Chapter 33: Evaluation - Why Accuracy Is Not Enough

A model achieves 95% accuracy on the test set. Is it good? It depends. If the task is classifying cats vs dogs, 95% is excellent. If the task is diagnosing cancer, 95% might be catastrophic—missing 5% of cases could mean thousands of deaths.

Accuracy is the most common metric in machine learning papers, and it is one of the least informative. It collapses all failure modes into a single number, hiding what matters: which errors the model makes, how often, and at what cost. Production systems require richer evaluation that captures real-world constraints.

This chapter explains why accuracy misleads, why benchmarks are poor proxies for deployment performance, and how to evaluate models in ways that matter.


Train vs Test: Why Validation Matters

The purpose of evaluation is to measure how a model generalizes to unseen data. If you evaluate on training data, the model gets 100% accuracy by memorizing. To measure generalization, you must evaluate on held-out test data.

Train/validation/test split:

Training set (70-80%): Used to train the model. The model sees these examples and adjusts weights to minimize loss.

Validation set (10-15%): Used to tune hyperparameters (learning rate, regularization). The model does not train on validation data, but you use validation performance to make decisions (which hyperparameters to use, when to stop training).

Test set (10-20%): Used only once, at the end, to report final performance. The model never sees test data during training or tuning. Test accuracy is your estimate of real-world performance.

Why three sets? If you tune hyperparameters using test data, you are implicitly fitting the test set—information leaks from test to training. The model appears to generalize, but you have overfit to the test distribution. A separate validation set lets you tune without contaminating the test set.

Overfitting to the test set happens when you evaluate repeatedly. Imagine you train 100 models, test them all, and report the best. You are selecting for test performance, which means you are fitting the test set. The reported accuracy is optimistic—it does not reflect performance on truly unseen data.

Academic benchmarks suffer from this. ImageNet has been used for a decade. Thousands of papers report results. Researchers tune architectures, hyperparameters, and data augmentation until ImageNet accuracy is maximized. The test set is no longer held-out—it has been implicitly fit through repeated evaluation. Reported “SOTA” results are partially artifacts of overfitting.

Cross-validation reduces overfitting to a single test split. K-fold cross-validation divides data into K folds (e.g., 5). Train on K-1 folds, test on the remaining fold. Repeat K times, rotating which fold is the test set. Average the results.

Cross-validation gives a more robust estimate of performance, but it is K times more expensive (train K models instead of 1). For large datasets or large models, this is prohibitive. For small datasets, cross-validation is essential.

Temporal splits are critical for time-series data. If you randomly split time-series data, the model sees future data points in training and past data points in testing. This is leakage—the model learns from the future. Correct splits respect time: train on data before time T, test on data after time T.

Example: A stock price prediction model must be trained on 2020-2022 data and tested on 2023 data. Randomly mixing 2020-2023 data across train/test is cheating. The model will learn patterns from 2023 and appear to generalize when it does not.


Accuracy Is Not Enough: Cost-Sensitive Errors

Accuracy measures the fraction of predictions that are correct:

Accuracy=Correct predictionsTotal predictions\text{Accuracy} = \frac{\text{Correct predictions}}{\text{Total predictions}}

Accuracy treats all errors equally. But not all errors have equal cost.

Example: Medical diagnosis

A cancer detection model classifies scans as “cancer” or “no cancer.” The test set has 1,000 scans: 950 healthy, 50 cancerous (5% prevalence).

Model A predicts “no cancer” for all scans. Accuracy: 950/1000 = 95%. The model is 95% accurate but useless—it missed every cancer case.

Model B predicts “cancer” for 100 scans: 45 true positives, 55 false positives (healthy patients flagged as cancer), 5 false negatives (cancers missed). Accuracy: (45 + 895) / 1000 = 94%.

Model B has lower accuracy than Model A, but it is far better. It catches 45/50 cancers (90% sensitivity). Model A catches 0/50 (0% sensitivity).

Accuracy is misleading because the classes are imbalanced (5% positive, 95% negative). A model that always predicts the majority class achieves high accuracy while being worthless.

Precision and recall capture different error modes:

Precision: Of the examples the model predicted positive, how many are actually positive?

Precision=True PositivesTrue Positives + False Positives\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}

For Model B: 4545+55=0.45\frac{45}{45 + 55} = 0.45 (45% of flagged scans are actual cancers).

Recall (Sensitivity): Of the actual positive examples, how many did the model catch?

Recall=True PositivesTrue Positives + False Negatives\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}

For Model B: 4545+5=0.90\frac{45}{45 + 5} = 0.90 (90% of cancers detected).

Precision-recall trade-off: You can increase recall by predicting “cancer” more often, but this reduces precision (more false positives). You can increase precision by predicting “cancer” only when very confident, but this reduces recall (more false negatives).

In medical diagnosis, false negatives (missing cancer) are catastrophic. False positives (flagging healthy patients) are costly but not deadly. You prioritize high recall, accepting lower precision. The model should err on the side of caution—flag more, miss less.

In spam filtering, false positives (marking legitimate email as spam) are more costly than false negatives (letting spam through). You prioritize precision over recall. Better to let some spam through than to lose important emails.

F1 score is the harmonic mean of precision and recall:

F1=2⋅Precision⋅RecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

F1 balances precision and recall, but it assumes equal importance. For imbalanced classes or cost-sensitive errors, precision and recall individually are more informative than a single score.


Distribution Shift: When Metrics Become Meaningless

Models are trained on one dataset and deployed on another. If the deployment distribution differs from the training distribution, test accuracy does not predict deployment performance.

Domain shift occurs when training and deployment come from different domains:

Example: Face recognition

A face recognition model trained on Flickr images (mostly well-lit, frontal faces, high resolution) achieves 99% accuracy on a Flickr test set. Deployed in surveillance cameras (varied lighting, angles, low resolution), accuracy drops to 85%. The test set does not match deployment.

Covariate shift (Chapter 31): The input distribution changes but the relationship between input and output stays the same. Test metrics become unreliable because the test set does not represent the deployment input distribution.

Concept drift (Chapter 31): The relationship between input and output changes. Test metrics are meaningless because the ground truth has changed.

Adversarial distribution shift occurs when adversaries actively manipulate inputs to fool the model:

Example: Fraud detection

A fraud detection model trained on 2022 fraud patterns achieves 98% accuracy on 2022 test data. Deployed in 2023, fraudsters have adapted. They use new tactics (account takeover instead of carding, synthetic identities instead of stolen cards). The model, trained on old patterns, misses new fraud. Test accuracy: 98%. Deployment accuracy: 70%.

The test set does not account for adversarial adaptation. Fraudsters are not random—they optimize to evade detection. Test sets cannot capture this without adversarial red teaming.

Detecting distribution shift requires monitoring production data (Chapter 31): compare feature distributions, track model confidence, measure performance on labeled samples. If distributions drift, retrain. Test accuracy is a snapshot, not a guarantee.


Human Evaluation: Why Humans Stay in the Loop

For many tasks, automated metrics are insufficient. The metric is not well-defined, ground truth is subjective, or what matters cannot be captured by a formula. Human evaluation is the only way to measure quality.

When human evaluation is necessary:

Open-ended generation: Text generation (stories, summaries, dialogue), image generation (DALL-E, Midjourney). Metrics like BLEU (translation), ROUGE (summarization), or perceptual similarity (images) correlate poorly with human judgment. Only humans can judge fluency, coherence, and quality.

Subjective tasks: Content moderation (is this toxic?), sentiment analysis (is this positive?), humor detection (is this funny?). Ground truth is subjective. Inter-human agreement is low. Metrics are noisy proxies.

Safety and alignment: Does the model refuse harmful requests? Does it follow instructions? Does it exhibit bias? These are qualitative judgments requiring human review.

Human evaluation methods:

Pairwise comparison: Show humans two model outputs (Model A and Model B) and ask: “Which is better?” Humans are better at relative judgments than absolute ratings.

Likert scale ratings: “Rate this response from 1 (very bad) to 5 (very good).” Easier to collect than pairwise comparisons but noisier (rating scales are subjective).

Red teaming: Hire experts to adversarially test the model—try to make it fail, generate harmful content, expose biases. Red teamers find edge cases automated metrics miss.

RLHF (Reinforcement Learning from Human Feedback) (Chapter 24): Humans rate model outputs, and these ratings train a reward model that guides further training. Human evaluation is not just measurement—it is the training signal.

Challenges of human evaluation:

Cost: Humans are expensive. Evaluating 10,000 model outputs at 0.10perrating=0.10 per rating = 1,000. For large-scale evaluation, this adds up.

Noise: Inter-rater agreement is low for subjective tasks. Different humans give different ratings. Aggregate ratings over multiple raters to reduce noise.

Bias: Human raters have biases (linguistic, cultural, demographic). Ratings reflect rater preferences, not objective quality.

Scale: Automated metrics scale to millions of examples. Human evaluation scales to thousands. You cannot human-evaluate every query in production.

Despite these challenges, human evaluation is necessary for tasks where automated metrics fail. The best approach combines both: automated metrics for fast iteration, human evaluation for final quality checks.


Hidden Failure Modes: Rare but Deadly Errors

Aggregate metrics (accuracy, F1, AUC) hide rare failures. A model can have 99% accuracy overall but fail catastrophically on specific subgroups or edge cases.

Long-tail failures are rare cases that matter:

Example: Self-driving cars

A self-driving perception model achieves 99.99% accuracy on pedestrian detection. Sounds excellent. But 0.01% failure on 1 million frames = 100 failures. If even one failure causes a crash, the model is unsafe.

Long-tail events—unusual clothing, occlusions, rare weather, edge-case scenarios—are underrepresented in test sets but critical in deployment. Aggregate accuracy does not capture tail risk.

Subgroup disparities: A model can have high overall accuracy but low accuracy on specific demographic subgroups:

Example: Face recognition

A face recognition model achieves 95% accuracy overall. Broken down by demographics:

  • Light-skinned males: 99% accuracy
  • Dark-skinned females: 65% accuracy

The aggregate metric hides that the model fails on underrepresented subgroups. Deployment causes harm to specific populations while appearing successful on average.

Adversarial examples (Chapter 34): Tiny perturbations that humans do not notice fool the model into wildly wrong predictions. Adversarial accuracy is near 0% for most models, even if standard accuracy is 99%. Test sets do not include adversarial examples unless explicitly constructed.

Stress testing probes for hidden failures:

Checklist-style evaluation (Ribeiro et al., 2020): Manually design test cases covering capabilities (negation, coreference, robustness to typos). Example: Sentiment analysis should correctly handle “not bad” (positive), “not good” (negative), “pretty bad” (negative). Simple accuracy does not catch these.

Counterfactual evaluation: Change one word in the input and measure if the prediction changes correctly. Example: “He is a doctor” → “She is a doctor” should not change predictions in gender-neutral tasks. If it does, the model has learned spurious gender correlations.

Worst-group performance: Report accuracy not just overall but on the worst-performing subgroup. A model with 95% average accuracy and 60% worst-group accuracy is biased. Deploy with caution.

Hidden Failure Modes: Rare but Deadly Errors diagram

Figure 33.1: Precision-recall trade-off curve. As recall increases (catching more positives), precision decreases (more false positives). The operating point depends on the cost of errors: spam filtering prioritizes precision (avoid flagging legitimate email), cancer detection prioritizes recall (catch all cases, accept false alarms).


Benchmarks Are Proxies, Not Goals

Benchmarks (ImageNet, GLUE, SuperGLUE, SQuAD) are standard datasets used to compare models. They enable fair comparison: same data, same splits, same metrics. But they are proxies for real-world performance, and proxies can mislead.

Benchmark saturation: When many researchers optimize for the same benchmark, performance plateaus. ImageNet top-1 accuracy reached 99%, exceeding estimated human performance. Does this mean computer vision is solved? No. Models fail on:

  • Domain shift: Natural images (ImageNet) vs medical scans, satellite imagery
  • Robustness: Small perturbations, occlusions, adversarial attacks
  • Generalization: Long-tail categories, rare objects, unusual viewpoints

SOTA (State-of-the-Art) does not mean deployment-ready. Achieving SOTA on SuperGLUE does not mean the model understands language—it means the model is good at SuperGLUE tasks. The benchmark is a proxy for understanding, but a noisy one.

Gaming benchmarks: Researchers tune models specifically for benchmark performance, sometimes learning shortcuts that do not generalize. Example: SQuAD (question answering) models learned to exploit biases in question phrasing rather than understanding context. When evaluated on adversarial versions of SQuAD, performance dropped 40%.

Benchmark artifacts: Datasets have biases and shortcuts. Models exploit these rather than learning the intended skill. The benchmark appears solved, but the model has not learned the underlying capability.

Production metrics differ from benchmark metrics:

  • Latency: Benchmarks ignore latency. Production requires <100ms.
  • Robustness: Benchmarks use clean test sets. Production faces noisy, adversarial, out-of-distribution inputs.
  • User satisfaction: Benchmarks measure accuracy. Production cares about engagement, retention, revenue.

Real-world evaluation requires business metrics, not just ML metrics. A model with 95% accuracy that increases revenue by 10% is more valuable than a model with 99% accuracy that increases revenue by 1%. Accuracy is a proxy. Revenue is the goal.


Engineering Takeaway

Accuracy alone is meaningless without context—class imbalance and cost-sensitive errors require precision, recall, and domain-specific metrics. A model can have 95% accuracy and be useless (predicting majority class) or catastrophic (missing critical cases). Precision and recall capture different failure modes. F1 balances them but assumes equal cost. For imbalanced or cost-sensitive problems, report precision, recall, and confusion matrices—not just accuracy.

Validation strategy must match deployment—use temporal splits for time-series, stratified splits for imbalanced data. Randomly splitting time-series data leaks future information into training. Randomly splitting imbalanced data may leave rare classes out of the test set. The test set must resemble deployment. If deployment is time-ordered (fraud detection, stock prediction), test sets must be time-ordered. If deployment has rare but critical cases (medical diagnosis, safety systems), test sets must include them.

Distribution shift invalidates test metrics—monitor production performance continuously, retrain when drift is detected. Test accuracy measures generalization to the test distribution, not the deployment distribution. When deployment distribution shifts (covariate shift, concept drift, adversarial adaptation), test metrics become stale. Production monitoring is the real test. Log predictions, sample labels, measure accuracy on recent data. When accuracy degrades, retrain or adapt. Test metrics are predictions, production metrics are ground truth.

Rare cases matter most in high-stakes domains—measure worst-group performance, not just average performance. A model with 95% average accuracy and 60% accuracy on a minority group is biased. Aggregate metrics hide disparities. Report accuracy broken down by subgroups (demographics, geographies, edge cases). For safety-critical systems, worst-case performance matters more than average. One catastrophic failure outweighs 1,000 successes.

Human evaluation is essential for open-ended tasks—automated metrics correlate poorly with quality for generation and subjective tasks. BLEU score does not capture fluency. Perceptual similarity does not capture artistic quality. Sentiment scores do not capture nuance. For generation (text, images, music), content moderation, and alignment, humans must judge quality. Human evaluation is expensive and noisy, but it is the ground truth automated metrics approximate. Combine automated metrics (cheap, scalable) with human evaluation (expensive, accurate) for robust assessment.

Benchmarks are proxies, not goals—SOTA on ImageNet does not mean vision is solved. Benchmarks enable comparison but do not capture deployment constraints. A model that achieves SOTA on GLUE but fails on out-of-domain text is not useful. A model that achieves 99% accuracy on ImageNet but takes 10 seconds to run is not deployable. Benchmark performance is a starting point, not an end goal. Production evaluation requires latency, robustness, fairness, and business metrics—not just benchmark scores.

Real-world evaluation requires business metrics—accuracy is a proxy for what actually matters (engagement, revenue, safety). ML models are not ends in themselves—they serve business goals. A search ranking model is valuable if it increases clicks and revenue, not if it improves NDCG. A recommendation model is valuable if it increases watch time and retention, not if it improves AUC. Business metrics are noisy and hard to measure, but they are what matter. Optimize ML metrics as proxies, but validate with business metrics.


References and Further Reading

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., & Gardner, M. (2021). EMNLP 2021

Why it matters: This paper examines the C4 dataset (used to train T5 and many other models) and documents biases, quality issues, and artifacts. It shows that even “clean” datasets contain offensive content, misinformation, and spam. The paper argues that dataset documentation is essential for understanding model behavior—without knowing what data the model trained on, you cannot explain its failures. It introduced a framework for dataset documentation that has influenced how researchers report data provenance.

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). ACL 2020

Why it matters: CheckList is a methodology for testing NLP models through capability-focused test cases (negation, coreference, robustness to typos). Rather than relying on aggregate accuracy, CheckList measures performance on specific linguistic phenomena. The paper shows that models with high accuracy on standard benchmarks fail simple CheckList tests, revealing hidden weaknesses. CheckList has been widely adopted for systematic testing of language models, showing that accuracy is insufficient for measuring model quality.

A Closer Look at Accuracy vs. Robustness Taori, R., Katariya, V., Yaghmaie, A., Recht, B., & Schmidt, L. (2020). arXiv:2004.06524

Why it matters: This paper investigates the trade-off between standard accuracy and robustness to distribution shift. It shows that models with high ImageNet accuracy often have low accuracy on shifted distributions (ImageNet-C, ImageNet-A). The paper challenges the assumption that higher benchmark accuracy means better models—robustness matters as much as accuracy, but benchmarks do not measure it. This work motivated research into robustness as a first-class evaluation criterion, not an afterthought.


The next chapter examines why models fail in scary ways: hallucinations that confidently generate false information, biases that amplify discrimination, adversarial examples that break perception, and brittleness that causes catastrophic failures on edge cases.