Evaluating Models Honestly
A hospital trains a model to detect a rare cancer that affects 1% of the screened population. The model predicts 'no cancer' for every patient. Its accuracy is 99%. Is this a good model? Obviously not — it misses every cancer case. Accuracy, the most commonly reported metric, is systematically misleading whenever classes are imbalanced or the costs of different mistakes are unequal. Honest model evaluation requires a toolkit of metrics and a discipline around how evaluation data is used.
The Confusion Matrix
For a binary classifier (positive vs. negative), every prediction falls into one of four cells: True Positive (TP): model predicts positive, true label is positive. Correct. True Negative (TN): model predicts negative, true label is negative. Correct. False Positive (FP): model predicts positive, true label is negative. Also called a Type I error or false alarm. False Negative (FN): model predicts negative, true label is positive. Also called a Type II error or miss. Accuracy = (TP + TN) / (TP + TN + FP + FN). In the cancer example: TN=9900, TP=0, FP=0, FN=100. Accuracy = 9900/10000 = 99%, but the model has never correctly identified a single cancer case. Precision measures how trustworthy positive predictions are: Precision = TP / (TP + FP). Of all patients the model flags as positive, what fraction truly have cancer? A low-precision model raises many false alarms — in a screening context this means unnecessary biopsies and patient anxiety. Recall (also called sensitivity or true positive rate) measures how many actual positives were found: Recall = TP / (TP + FN). Of all patients who truly have cancer, what fraction did the model catch? A low-recall model misses cases — potentially fatal in a screening context. There is a fundamental trade-off: adjusting the decision threshold moves precision and recall in opposite directions. If you lower the threshold for calling something positive (making the model more aggressive), you catch more true positives (recall increases) but also more false positives (precision decreases). If you raise the threshold, precision increases but recall falls. F1 score is the harmonic mean of precision and recall: F1 = 2 × (Precision × Recall) / (Precision + Recall). The harmonic mean punishes extreme imbalance between precision and recall more than the arithmetic mean would — a model with precision=1.0 and recall=0.01 gets F1=0.02, not 0.5. This makes F1 a useful single summary when both precision and recall matter. For multiclass problems, precision, recall, and F1 are computed per class, then averaged. Macro averaging treats all classes equally (average of per-class scores). Weighted averaging weights each class's score by its frequency in the dataset.
Precision and recall are not symmetrically important in every application. In cancer screening, a false negative (missed cancer) is catastrophic; optimize for recall. In spam filtering, a false positive (legitimate email flagged as spam) is highly disruptive; optimize for precision. Always ask: which error is more costly in this specific context?
Additional metrics for specific contexts. ROC curve (Receiver Operating Characteristic): plots True Positive Rate (recall) against False Positive Rate at every possible threshold. A perfect model's curve hugs the top-left corner. AUC (Area Under the Curve) summarizes the entire curve as a single number: 0.5 = random guessing, 1.0 = perfect. AUC measures how well the model ranks positives above negatives, independent of any specific threshold choice. Precision-Recall curve: more informative than the ROC curve for highly imbalanced datasets, because it does not include true negatives (which dominate accuracy and inflate the ROC when negatives are abundant). Regression metrics: for continuous-output models, Mean Absolute Error (MAE = average of |y_pred - y_true|) and Root Mean Squared Error (RMSE = sqrt(average of (y_pred - y_true)^2)) are standard. RMSE penalizes large errors more heavily due to squaring, making it sensitive to outliers. Calibration: a model is well-calibrated if, among all predictions of probability p, approximately p fraction are actually correct. A model that says '90% confidence' for things that are only right 60% of the time is poorly calibrated — dangerous in high-stakes settings. Calibration can be visualized with reliability diagrams. Test-set discipline is as important as metric choice. The fundamental rule: the test set must never influence any decision made during training or validation. If you evaluate on the test set and then modify your model based on the result — even once — the test set has leaked into your model selection process and the reported performance is optimistic. The correct protocol: split data into train, validation, and test before any modeling. Use validation for hyperparameter tuning and architecture selection. Report test set performance exactly once, at the end, as your honest estimate of real-world performance. Data leakage — any pathway by which information about the test labels influences training or model selection — is one of the most common sources of inflated published results.
Complete each statement with the correct term.
Evaluation in the Wild
Published benchmark results are not the same as real-world performance. Several gaps must be understood. Distribution shift: benchmark test sets reflect the distribution of data at collection time. Real-world data changes — a model evaluating news sentiment trained in 2020 may struggle with new vocabulary and events in 2026. Evaluation on a static benchmark cannot capture this. Shortcuts and spurious correlations: models sometimes achieve high benchmark accuracy by learning features that correlate with labels in the training and test data but are not genuinely predictive in deployment. A famous example: a skin cancer classifier that achieved high accuracy partly by learning to detect the presence of rulers (often photographed alongside lesions by clinicians) — a spurious correlate with the 'lesion' label. The model learned the wrong thing but evaluated well because the spurious feature appeared in both train and test. Benchmark saturation: when models achieve near-human performance on a benchmark, it may indicate the benchmark is too easy rather than that the problem is solved. Researchers periodically retire saturated benchmarks and create harder ones. Fairness audits: aggregate metrics may hide disparate performance across demographic subgroups. A facial recognition model might achieve 99% accuracy overall but 65% accuracy on dark-skinned women — a serious equity problem invisible in aggregate reporting. Disaggregated evaluation — reporting metrics separately by subgroup — is an ethical requirement for deployed systems that affect people.
Evaluating only on the benchmark test set tells you how well the model performs on that specific distribution. It tells you almost nothing about performance on user data from a different geography, time period, or demographic. Always evaluate on data that matches your actual deployment population before shipping.
A fraud detection model flags 100 transactions as fraudulent. 80 are genuine fraud; 20 are legitimate transactions incorrectly flagged. What is the model's precision on the fraud class?
A researcher tunes 50 hyperparameter combinations by evaluating each on the test set, then reports the best result as the model's performance. What is wrong with this approach?
Audit a Confusion Matrix
- Step 1. A model screens applicants for a scholarship. In a validation set of 1000 applicants: 200 were truly qualified; 800 were not. The model predicted 'qualified' for 150 truly qualified and 100 unqualified applicants.
- Step 2. Fill in the confusion matrix: compute TP, TN, FP, FN.
- Step 3. Compute accuracy, precision, recall, and F1 for the 'qualified' class.
- Step 4. The foundation argues that missing a qualified applicant (FN) is worse than a false alarm (FP). Given this, which metric should they optimize?
- Step 5. Suppose the model performs significantly worse for applicants from rural areas than urban areas. The overall metrics look fine. What should the foundation demand before deploying this model?
- Step 6. Write one paragraph summarizing what the metrics tell you and one concrete recommendation for the foundation.