Skip to main content
AI Safety, Alignment & Ethics

⏱ About 20 min20 XP

Interpretability: Opening the Box

If opacity is the disease, interpretability research is the medicine — or at least the attempt at medicine. Researchers have developed a growing toolkit of techniques designed to illuminate what happens inside a model: which inputs mattered, what concepts the model encoded, where in the network a capability lives, and when the model is likely to be wrong. This lesson surveys the major categories of that toolkit, what each can and cannot tell you, and why interpretability matters for AI safety.

Feature Attribution: What Drove This Prediction?

The most practical interpretability question is local: for this particular input, which features most influenced the output? Feature attribution methods assign a score — sometimes called an importance weight — to each input feature. LIME (Local Interpretable Model-agnostic Explanations) answers this by perturbing an input slightly in many ways, observing how the output changes, and fitting a simple linear model to those perturbations. The linear model's coefficients become the feature importances. LIME is model-agnostic: it treats the underlying model as a black box and only observes input-output pairs. SHAP (SHapley Additive exPlanations) is grounded in cooperative game theory. It computes how much each feature 'contributed' to pushing the prediction away from the average prediction, drawing on a concept called Shapley values. SHAP satisfies several mathematical fairness axioms that make its attributions more theoretically principled than LIME. Gradient-based methods — integrated gradients, GradCAM, Guided Backpropagation — take a different approach: they propagate signal backward through the network to see which input dimensions the output is most sensitive to. These methods require access to the model's internals (its gradients), so they are not model-agnostic.

Saliency Maps: Where Did the Model Look?

For image models, gradient-based attribution produces a saliency map — a heatmap overlaid on the image showing which pixels most influenced the prediction. A saliency map that highlights the tumor itself is reassuring; one that highlights the ruler or the hospital watermark reveals a shortcut. Saliency maps have become standard diagnostic tools in medical AI, though they come with limitations: they show correlation with the output gradient, not causal contribution.

Probing and Concept Activation Vectors

Feature attribution tells you about inputs. Probing tells you about the model's internal representations — the activations at each layer. In a probing experiment, a researcher extracts the hidden-layer activations of a trained model and asks: does any linear transformation of these activations predict some interpretable property? For example, does a language model's 12th-layer activations encode grammatical subject-verb agreement? If a simple linear classifier trained on those activations predicts agreement with 95% accuracy, we have evidence that the model has encoded this grammatical concept internally — even though it was never explicitly trained to do so. Testing with Concept Activation Vectors (TCAV) extends this idea to image classifiers. A researcher defines a concept — say, 'striped texture' — using a set of example images. TCAV finds the linear direction in the model's activation space corresponding to that concept and measures how much predictions in a class (e.g., 'zebra') depend on that direction. High TCAV scores for 'striped' in the 'zebra' class confirm the model uses stripes as a key feature — a sanity check that builds confidence. Mechanistic interpretability goes further: instead of finding what concepts a layer encodes, it attempts to reverse-engineer the entire computational algorithm the model implements. Researchers at Anthropic and elsewhere have identified individual attention heads and circuits within transformers that perform specific operations — induction heads that copy repeated sequences, name-retrieval circuits that look up facts about entities. This line of work aspires to provide a full account of a model's computation.

Match each interpretability technique to its defining characteristic.

Terms

LIME
SHAP
GradCAM
Probing classifier
TCAV

Definitions

Backpropagates gradients through a CNN to produce a pixel-level heatmap
Uses Shapley values from game theory to fairly distribute prediction credit
Tests whether a linear model can predict an interpretable property from hidden activations
Measures how much a human-defined concept influences a class prediction
Fits a simple model to local perturbations to approximate feature importance

Drag terms onto their definitions, or click a term then click a definition to match.

Limits and Failure Modes of Interpretability Tools

Every interpretability method has known failure modes, and safety-conscious practitioners must know them. Post-hoc explanations can be unfaithful. Research by Adebayo et al. showed that some saliency map methods produce visually plausible heatmaps even when the model's weights are randomized — meaning the maps look like they're highlighting meaningful features but contain no information about what the model actually computed. If an explanation looks reasonable regardless of what the model did, it is not a reliable window into the model. SHAP and LIME can disagree substantially on the same model and input. Because they use different approximation strategies, they can produce contradictory attributions. There is no single ground truth about feature importance, and different tools embody different assumptions. Probing measures correlation, not causation. A probing classifier showing that a layer encodes some concept does not prove the model uses that concept in its computation — only that the information is present. The model might use an entirely different path. Interpretability tools can be gamed. Models can be constructed to produce friendly-looking explanations while still making decisions on hidden discriminatory features. A technically competent adversary can make a biased model appear to explain itself fairly.

Explanations Can Create False Confidence

One of the subtler dangers of interpretability tools is that they make humans feel confident in a model's behavior even when the explanation is incomplete or wrong. A radiologist shown a saliency map highlighting tumor tissue may trust a model's diagnosis far more than warranted — because the explanation looks right, they stop scrutinizing. Explanations must be treated as hypotheses to verify, not certificates of correctness.

A researcher trains a probing classifier on layer 15 of a large language model and finds it predicts whether the next word is a noun with 91% accuracy. What does this tell us?

Research found that some saliency map methods produce nearly identical heatmaps even when model weights are replaced with random values. This finding implies:

Evaluate a Real Explanation

  1. Access a public AI demo that provides explanations — many text sentiment classifiers, image classifiers, or SHAP-based tabular models are available online through tools like Hugging Face or Google's What-If Tool.
  2. Step 1: Submit five different inputs and record the explanation the system provides for each.
  3. Step 2: For two of the explanations, try to generate a modified input that changes the attribution without changing the prediction. For example: if the system says the word 'terrible' drove a negative sentiment prediction, remove that word and see if sentiment changes or if the model finds another driver.
  4. Step 3: Form a hypothesis about whether the explanation is faithful to the model's actual behavior based on your experiments.
  5. Step 4: Write a one-paragraph evaluation: 'Based on my experiments, I believe the explanations provided by this system are / are not trustworthy because...'