Evaluating LLM outputs at scale on AWS, The Cloud Ledger

We shipped a summarization feature, and within a week support tickets came in about summaries that "sounded right but left out the refund amount." Eyeballing a dozen outputs in a notebook had told us the model was great. It was not. The gap was that we had no way to evaluate thousands of outputs the way real traffic exercised them.

Evaluating LLM outputs at scale is a data pipeline problem as much as a modeling one. Here is how I build evaluation harnesses on AWS that run over tens of thousands of examples and give a number I can put in a dashboard.

Pick the right kind of metric per task

There is no single LLM quality score. I bucket tasks and choose accordingly:

Task	Metric type	Tooling
Extraction / classification	Exact match, F1 against labels	Deterministic Python
Summarization / RAG answers	LLM-as-judge on faithfulness	Bedrock model call
Free-form generation	Pairwise preference vs baseline	LLM-as-judge, A/B
Retrieval quality	Recall@k, MRR	Deterministic over ground truth

Use deterministic checks wherever a ground truth exists. They are free, fast, and not subject to a judge's own errors. Reserve LLM-as-judge for the genuinely subjective dimensions.

LLM-as-judge that you can trust

An LLM judge is only useful if it is calibrated against human labels. I always validate the judge against a few hundred human-rated examples before trusting it on the full set, and I force structured output so scoring is parseable.

import json, boto3

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

JUDGE_PROMPT = """You are grading a summary for FAITHFULNESS to the source.
Source: {source}
Summary: {summary}
Return only JSON: {{"faithful": true|false, "missing_facts": [..], "score": 1-5}}"""

def judge(source: str, summary: str) -> dict:
    body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 512,
        "temperature": 0,
        "messages": [{
            "role": "user",
            "content": JUDGE_PROMPT.format(source=source, summary=summary),
        }],
    }
    resp = bedrock.invoke_model(
        modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
        body=json.dumps(body),
    )
    text = json.loads(resp["body"].read())["content"][0]["text"]
    return json.loads(text)

Set the judge temperature to 0 and force JSON output. A judge that phrases its verdict differently each run is unmeasurable, and you cannot improve what you cannot measure consistently.

Running it at scale, cheaply

Calling the judge synchronously over 50,000 examples is slow and expensive. Two AWS levers help:

Bedrock batch inference. Submit a JSONL file of prompts to S3 and let Bedrock process them asynchronously at roughly half the on-demand token price. Latency is minutes to hours, which is fine for offline eval.
Step Functions Map state. Fan out deterministic scoring across the dataset with a concurrency cap so you control throughput and cost.

I store every raw output, its prompt, the model version, and the eval scores in a partitioned table so I can query trends in Athena. Tracking the model version is non-negotiable; otherwise a regression after a model update is invisible.

Close the loop with regression gates

Evaluation that does not gate a deploy is just a report nobody reads. I wire the harness into CI: a candidate prompt or model must beat the current production baseline on the held-out set before it ships.

aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:llm-eval \
  --input '{"candidate":"prompt-v7","baseline":"prod","dataset":"s3://evals/golden-500.jsonl"}'
# CI fails the build if faithfulness drops below the baseline threshold

The held-out "golden" set is curated by hand from real failure cases, including that missing-refund-amount example that started all this. Real failures make the best regression tests.

Takeaways

Use deterministic metrics wherever ground truth exists; reserve LLM-as-judge for subjective dimensions only.
Calibrate any judge against human labels first, run it at temperature 0, and force structured JSON output.
Use Bedrock batch inference and Step Functions Map to score large sets at controlled cost and concurrency.
Gate deploys on a curated golden set built from real failures, and always log the model version with each score.

Evaluating LLM outputs at scale on AWS

Pick the right kind of metric per task

LLM-as-judge that you can trust

Running it at scale, cheaply

Close the loop with regression gates

Takeaways

More on Machine Learning