Skip to main content
AI Foundations

⏱ About 20 min20 XP

Evaluation: Metrics That Matter

A model that achieves 99% accuracy on a cancer screening dataset might be worthless — or even harmful. How? If only 1% of patients in the dataset have cancer, a model that simply predicts 'no cancer' for every single patient achieves 99% accuracy without ever detecting a single case. Accuracy, the most intuitive metric, is also the most misleading on imbalanced datasets. Choosing the right evaluation metric is not a technicality — it is the difference between a model that helps people and one that creates dangerous false confidence.

The Confusion Matrix

For a binary classification problem, every prediction falls into one of four categories, organized into a confusion matrix. True Positive (TP): the model predicted positive and the label is positive. Correct detection. False Positive (FP): the model predicted positive but the label is negative. A false alarm. True Negative (TN): the model predicted negative and the label is negative. Correct rejection. False Negative (FN): the model predicted negative but the label is positive. A missed detection. These four numbers are the foundation of every binary classification metric. They capture not just how often the model is right (accuracy) but the character of its errors — which types of mistakes it makes more. Example: a model is evaluated on 1,000 patients. 50 have a disease; 950 do not. Results: TP=40, FP=30, TN=920, FN=10. Total correct: 960/1000 = 96% accuracy. But 10 of 50 disease cases were missed — a 20% miss rate — while 30 healthy patients were falsely alarmed.

Reading a Confusion Matrix

Always look at the full confusion matrix before reporting any single metric. A high overall accuracy can hide terrible performance on the minority class. The confusion matrix shows you exactly where your model is failing and what kind of errors it is making.

From the confusion matrix, four key metrics are derived. Accuracy: (TP + TN) / (TP + FP + TN + FN). Fraction of all predictions that are correct. Easy to understand but misleading on imbalanced datasets. Precision: TP / (TP + FP). Of all the cases the model flagged as positive, what fraction actually were? High precision means few false alarms. In the disease example: 40 / (40+30) = 0.571, or 57.1%. Recall (also called Sensitivity or True Positive Rate): TP / (TP + FN). Of all actual positives, what fraction did the model detect? High recall means few missed cases. In the example: 40 / (40+10) = 0.80, or 80%. F1 Score: the harmonic mean of precision and recall, computed as 2 * (Precision * Recall) / (Precision + Recall). F1 is a single number that balances both. In the example: 2 * (0.571 * 0.80) / (0.571 + 0.80) = 0.667. The harmonic mean penalizes large imbalances between precision and recall more strongly than the arithmetic mean would, making F1 a better summary when both matter.

Choosing the Right Metric

Which metric should you optimize? The answer comes from understanding the cost of each error type in your specific problem. For a disease screening test: a false negative (missed diagnosis) might mean a patient goes untreated and deteriorates. A false positive means an unnecessary follow-up test — stressful and expensive, but not dangerous. Here, recall is paramount. You want to catch as many true positives as possible, even at the cost of more false alarms. For a spam filter: a false positive (legitimate email flagged as spam) means the user misses an important message. A false negative (spam reaching the inbox) is annoying but recoverable. Here, precision is more important — you cannot afford false alarms. For a content moderation system reviewing posts for violence: both types of errors have real costs. False positives suppress legitimate speech. False negatives allow harm. F1 score or a weighted average of precision and recall balances the trade-off. For regression problems, different metrics apply. Mean Absolute Error (MAE) computes the average magnitude of errors without squaring them, treating all errors equally. Mean Squared Error (MSE) squares errors, penalizing large mistakes more heavily. Root Mean Squared Error (RMSE) is the square root of MSE, returning the metric to the same units as the target variable. Which to use depends on whether large errors are disproportionately costly — if so, use MSE or RMSE.

Metric Hacking

Any metric can be gamed. A model can achieve perfect recall on a binary classifier by predicting positive for every example — it will catch every true positive, but precision will be terrible. Always report multiple metrics together and always connect your chosen metric back to the actual cost of each error type in your problem.

Match each metric to what it directly measures.

Terms

Accuracy
Precision
Recall
F1 Score
RMSE

Definitions

Fraction of positive predictions that are genuinely positive
Fraction of all predictions that are correct
Fraction of actual positives that the model detected
Harmonic mean of precision and recall
Square root of the average squared prediction error for regression

Drag terms onto their definitions, or click a term then click a definition to match.

A fraud detection model reviews 10,000 transactions. 100 are fraudulent. The model flags 90 fraudulent and 200 non-fraudulent transactions as fraud. What is the model's precision on fraud detection?

Why is the F1 score computed as the harmonic mean of precision and recall, rather than the simple arithmetic mean?

Choose and Justify Your Metric

  1. Read each scenario below. For each one: (1) compute the confusion matrix numbers from the information given, (2) compute accuracy, precision, and recall, (3) state which metric you would use to evaluate this model and why, and (4) state whether the model's performance is acceptable given the stakes.
  2. Scenario A: A wildfire early-warning system monitors satellite data for 500 grid cells per day. On a given day, 10 cells are actually on fire. The model flags 9 of them (correctly) and also flags 15 non-fire cells as fire.
  3. Scenario B: An email provider's spam filter processes 10,000 emails per day. 500 are spam. The filter correctly identifies 480 as spam and incorrectly flags 50 legitimate emails as spam.
  4. For each scenario, write your reasoning in 3-5 sentences, connecting the metric choice to the real-world cost of each error type.