All articles
AI Guides

What LLM Benchmarks Don't Tell You (And How to Evaluate AI Models Yourself)

Benchmark scores are useful but widely misunderstood. This guide explains contamination, benchmark gaming, and the real-world gap — then shows you how to evaluate models for your specific tasks.

Travis Johnson

Travis Johnson

Founder, Deepest

July 27, 202511 min read

A model with a 90% MMLU score isn't necessarily twice as useful as one with a 45% score. Benchmarks measure specific things in specific conditions, and the gap between benchmark performance and real-world utility is wider than most people realize. Here's how to think about benchmarks correctly — and how to evaluate models for your actual needs.

The Three Core Problems with AI Benchmarks

1. Benchmark Contamination

Contamination happens when a model's training data includes the benchmark test set — meaning the model has effectively "seen the answers" before taking the test. This inflates scores without reflecting genuine capability improvement.

Contamination is widespread and poorly policed. Many benchmark datasets are publicly available, and since AI models are trained on internet data, training sets inevitably overlap with test sets. The models with the highest scores on MMLU are often the ones with the most internet training data — which includes MMLU questions and answers.

Researchers have documented clear contamination in multiple major models. In one study, a model that scored 87% on MMLU fell to 72% when tested on a "secret" variant of MMLU with rephrased but equivalent questions. The 15-point gap represents contamination, not capability.

2. Benchmark Gaming

Benchmark gaming is different from contamination. It refers to deliberate training choices that optimize for benchmark performance without improving actual capability. Models can be fine-tuned specifically on benchmark-adjacent tasks to boost scores without becoming more useful in practice.

Signs of potential gaming include: a model that dramatically outperforms its apparent capability tier on one specific benchmark while underperforming on others; scores that improve rapidly across releases without corresponding improvements in real-world use; discrepancies between provider-reported scores and independent reproductions.

3. The Real-World Gap

Benchmarks test performance in controlled conditions. Real use involves:

  • Ambiguous prompts with implicit context
  • Domain-specific terminology and conventions
  • Multi-turn conversations where context compounds
  • Tasks that don't map neatly to benchmark categories
  • Quality dimensions benchmarks don't measure (tone, brevity, format)

A model that scores 90% on MMLU (multiple-choice academic questions) might underperform a 85% model on your actual task if that task involves creative writing, code review, or nuanced instruction following — none of which MMLU tests.

Key Finding: In our testing, benchmark ranking order matches real-world task performance only about 60% of the time. For tasks at the extremes (very hard or very easy), benchmark rankings are more predictive. For tasks in the middle range, the correlation is weak.

What Benchmarks Actually Measure

Benchmark What It Measures Contamination Risk Reliability
MMLU Broad knowledge across 57 subjects High (public dataset) Medium
HumanEval Python function generation Medium High for coding
MATH Competition math problems Medium High for math
GPQA Expert science questions Low (proprietary) High
MT-Bench Multi-turn instruction following Low Medium-High
HellaSwag Common sense completion High Low (near-saturated)
TruthfulQA Resistance to common misconceptions Medium Medium

How to Evaluate Models Yourself

The most reliable evaluation is testing models on your actual tasks. This doesn't have to be rigorous — even informal A/B testing of 10–20 prompts reveals meaningful differences.

Step 1: Define Your Use Case Precisely

Don't evaluate "AI models generally" — evaluate them on the specific tasks you need. A model evaluation for a customer service chatbot should test customer service tasks, not general knowledge. Common mistake: using general benchmark rankings as a proxy for specific-task rankings.

Step 2: Create a Test Set of Representative Prompts

Write 15–25 prompts that represent your real-world tasks. Include:

  • Easy cases (both models should handle these correctly)
  • Medium cases (where quality differences will show)
  • Hard cases (edge cases and adversarial inputs)
  • Representative examples of the actual prompts you'll send in production

Step 3: Evaluate Outputs Blindly

Evaluating your own prompts while knowing which model produced which output introduces bias. A simple way to reduce this: have a colleague evaluate outputs without knowing the source, or randomize the order before reviewing.

Step 4: Score on Dimensions That Matter for Your Use Case

Different use cases have different quality dimensions. Define yours before evaluating:

  • Accuracy: Is the information correct?
  • Completeness: Did it address everything in the prompt?
  • Format: Is the output structured appropriately?
  • Tone: Does it match the required register?
  • Conciseness: Is it appropriately brief?
  • Instruction adherence: Did it follow all the constraints?

Step 5: Run Multiple Trials

AI outputs are stochastic — the same prompt can produce different results. Run each prompt 3–5 times and average your scores. Single-trial evaluations are misleading.

The Case for Multi-Model Comparison

The most practical way to escape benchmark confusion is to run multiple models on your actual prompts simultaneously. Rather than asking "which model is best," ask "which model produces the best output for this specific prompt?"

When you compare GPT-4o and Claude side-by-side on your own tasks, you'll find that the better model varies by prompt type — and the variation will match your needs rather than a benchmark committee's choices.

Red Flags When Reading Model Benchmarks

  • Provider-only results without independent reproduction: Self-reported scores without third-party validation are less reliable
  • Narrow benchmark selection: Reporting only the benchmarks where the model excels
  • Sudden dramatic improvements on legacy benchmarks: HellaSwag, WinoGrande, and similar older benchmarks are near-saturated — improvements there may reflect contamination
  • Scores without error bars: Statistical uncertainty is real; point estimates without confidence intervals overstate precision

Frequently Asked Questions

Are MMLU scores meaningful at all?

Yes, with caveats. MMLU is a useful rough signal of general knowledge breadth. The relative ordering of models is broadly predictive. But the absolute numbers are inflated by contamination, and the small differences between models (87% vs 88%) shouldn't be treated as meaningful.

Which benchmarks are most trustworthy?

GPQA has lower contamination risk because it uses proprietary, non-public questions. HumanEval is reliable for coding tasks because it tests functional correctness (code that runs vs. code that doesn't). MATH is relatively trustworthy. MT-Bench is useful for conversational quality.

How do I know if a model was trained on a benchmark?

You usually can't know for certain. Signs include: unusually high scores on public benchmarks compared to newer private benchmarks; performance that drops significantly on rephrased versions of benchmark questions; dramatic score improvements without apparent architectural changes.

Is Chatbot Arena a better benchmark?

Chatbot Arena (lmsys.org) uses human preference judgments from real conversations — which has better ecological validity than multiple-choice tests. It's less susceptible to contamination. The limitation is that it measures average preference, which may not predict performance on your specific task.

LLM benchmarksevaluationAI testingmethodology

See it for yourself

Run any prompt across ChatGPT, Claude, Gemini, and 300+ other models simultaneously. Free to try, no credit card required.

Try Deepest free →

Related articles