Evaluating LLM outputs at scale on AWS
Building an evaluation harness so you can ship prompt and model changes with confidence.
We shipped a summarization feature, and within a week support tickets came in about summaries that "sounded right but left out the refund amount." Eyeballing a dozen outputs in a notebook had told us the model was great. It was not. The gap was that we had no way to evaluate thousands of outputs the way real traffic exercised them.
Evaluating LLM outputs at scale is a data pipeline problem as much as a modeling one. Here is how I build evaluation harnesses on AWS that run over tens of thousands of examples and give a number I can put in a dashboard.
Pick the right kind of metric per task
There is no single LLM quality score. I bucket tasks and choose accordingly:
| Task | Metric type | Tooling |
|---|---|---|
| Extraction / classification | Exact match, F1 against labels | Deterministic Python |
| Summarization / RAG answers | LLM-as-judge on faithfulness | Bedrock model call |
| Free-form generation | Pairwise preference vs baseline | LLM-as-judge, A/B |
| Retrieval quality | Recall@k, MRR | Deterministic over ground truth |
Use deterministic checks wherever a ground truth exists. They are free, fast, and not subject to a judge's own errors. Reserve LLM-as-judge for the genuinely subjective dimensions.
LLM-as-judge that you can trust
An LLM judge is only useful if it is calibrated against human labels. I always validate the judge against a few hundred human-rated examples before trusting it on the full set, and I force structured output so scoring is parseable.
import json, boto3
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
JUDGE_PROMPT = """You are grading a summary for FAITHFULNESS to the source.
Source: {source}
Summary: {summary}
Return only JSON: {{"faithful": true|false, "missing_facts": [..], "score": 1-5}}"""
def judge(source: str, summary: str) -> dict:
body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 512,
"temperature": 0,
"messages": [{
"role": "user",
"content": JUDGE_PROMPT.format(source=source, summary=summary),
}],
}
resp = bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
body=json.dumps(body),
)
text = json.loads(resp["body"].read())["content"][0]["text"]
return json.loads(text)
Set the judge temperature to 0 and force JSON output. A judge that phrases its verdict differently each run is unmeasurable, and you cannot improve what you cannot measure consistently.
Running it at scale, cheaply
Calling the judge synchronously over 50,000 examples is slow and expensive. Two AWS levers help:
- Bedrock batch inference. Submit a JSONL file of prompts to S3 and let Bedrock process them asynchronously at roughly half the on-demand token price. Latency is minutes to hours, which is fine for offline eval.
- Step Functions Map state. Fan out deterministic scoring across the dataset with a concurrency cap so you control throughput and cost.
I store every raw output, its prompt, the model version, and the eval scores in a partitioned table so I can query trends in Athena. Tracking the model version is non-negotiable; otherwise a regression after a model update is invisible.
Close the loop with regression gates
Evaluation that does not gate a deploy is just a report nobody reads. I wire the harness into CI: a candidate prompt or model must beat the current production baseline on the held-out set before it ships.
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:llm-eval \
--input '{"candidate":"prompt-v7","baseline":"prod","dataset":"s3://evals/golden-500.jsonl"}'
# CI fails the build if faithfulness drops below the baseline threshold
The held-out "golden" set is curated by hand from real failure cases, including that missing-refund-amount example that started all this. Real failures make the best regression tests.
Takeaways
- Use deterministic metrics wherever ground truth exists; reserve LLM-as-judge for subjective dimensions only.
- Calibrate any judge against human labels first, run it at temperature 0, and force structured JSON output.
- Use Bedrock batch inference and Step Functions Map to score large sets at controlled cost and concurrency.
- Gate deploys on a curated golden set built from real failures, and always log the model version with each score.