Skip to main content
AI Foundations

⏱ About 15 min15 XP

Accuracy: How Good Is It?

Imagine a doctor who always tells patients they are healthy — for every single patient, no matter what. If 95% of patients actually are healthy, this doctor would have 95% accuracy. That sounds impressive until you realize the doctor never catches any illness at all. This is the problem with measuring a model's performance using only accuracy. It is a useful starting point, but it is rarely the whole story.

Accuracy as a Starting Point

Accuracy is the simplest performance metric for a classification model. It answers a straightforward question: out of all the predictions the model made, what fraction were correct? Accuracy = (number of correct predictions) ÷ (total number of predictions) If a model correctly labels 87 out of 100 images, its accuracy is 87/100 = 0.87, or 87%. Higher is better, and 100% would mean every prediction was right. For many balanced problems — where each category appears roughly equally often — accuracy is a reasonable metric. A model that correctly identifies 90% of dog photos and 90% of cat photos in a mixed dataset with equal numbers of each is genuinely performing well, and accuracy captures that.

Definition: Accuracy

Accuracy is the proportion of predictions a model got correct. It equals the number of correct predictions divided by the total number of predictions, expressed as a percentage.

The doctor problem from the introduction reveals when accuracy breaks down: imbalanced datasets. Imbalanced means one category is much more common than others. In medical screening, healthy patients vastly outnumber sick ones. In fraud detection, legitimate transactions vastly outnumber fraudulent ones. In these cases, a model that always predicts the majority class will have high accuracy — while being useless for the actual task. To handle this, machine learning practitioners use additional metrics that break down the model's performance more carefully. Two key concepts are: Precision: of all the times the model predicted 'positive' (e.g., disease present), how often was it actually right? High precision means few false alarms. Recall: of all the cases that were actually positive, how many did the model catch? High recall means few cases missed. A perfect model would have both high precision and high recall. In practice, there is often a trade-off: a very cautious model that only predicts 'positive' when it is almost certain will have high precision but low recall. An aggressive model that flags anything suspicious will have high recall but low precision.

Flashcards — click each card to reveal the answer

Choosing the Right Metric

Which metric matters most depends on the stakes of the task. For a spam filter, a false positive means a legitimate email gets blocked — annoying but manageable. A false negative means spam reaches the inbox — also annoying but bearable. Roughly equal weight to precision and recall is appropriate. For cancer screening, a false negative means a real cancer is missed — potentially fatal. A false positive means a healthy person gets extra tests — worrying but survivable. Here, recall matters far more than precision. You would rather over-screen than miss a case. For an automated loan approval system, a false positive (approving a loan for someone who cannot repay it) costs the lender money. A false negative (rejecting a creditworthy applicant) costs the customer an opportunity. The trade-off has both financial and fairness implications. This is why responsible AI development requires asking not just 'what is the accuracy?' but 'what kinds of mistakes is this model making, and what are the consequences?'

Accuracy Can Lie

On imbalanced datasets, a model with high accuracy can be completely useless. Always check what the majority-class baseline accuracy would be — if always predicting the majority class gives 95% accuracy, a model at 96% is barely better than random. Ask about precision, recall, and what kinds of errors the model actually makes.

Fill in the blanks with the correct metric names.

The proportion of all predictions that were correct is called . The proportion of positive predictions that were actually correct is called . The proportion of actual positives the model caught is called .

A disease-screening model has 98% accuracy on a dataset where 98% of patients are healthy. Why is this result unimpressive?

For a wildfire detection system that should never miss a real fire, which metric is most critical?

Design Your Own Evaluation

  1. Pick one of these machine learning tasks: (a) detecting whether food has spoiled, (b) identifying whether a student needs extra help in math, (c) deciding whether to flag a social media post for review.
  2. Step 1: Define what a 'positive' prediction means for your chosen task.
  3. Step 2: Describe what a false positive would look like in practice. What is the consequence?
  4. Step 3: Describe what a false negative would look like in practice. What is the consequence?
  5. Step 4: Based on those consequences, write a recommendation: should this system prioritize precision, recall, or balance them equally? Explain your reasoning.
  6. This kind of thinking — connecting metrics to real-world consequences — is what separates thoughtful AI development from naive 'just maximize accuracy' engineering.