Skip to main content
AI Safety, Alignment & Ethics

⏱ About 20 min20 XP

Monitoring Deployed AI

Deploying an AI system is not the end of the engineering process — it is the beginning of a new and more consequential phase. In production, a model encounters the full messy complexity of the real world: edge cases developers never imagined, inputs from populations not represented in training data, behaviors that emerge at scale, and gradual drift as the world changes around a static model. Monitoring is the practice of continuously watching deployed systems so that failures are caught early, causes are understood, and corrective action can be taken before harm accumulates.

What to Monitor and Why

Effective monitoring requires choosing what to measure. The right signals depend on the system, but several categories apply broadly. Prediction distribution monitoring tracks the statistical properties of model outputs over time. If a classifier that normally outputs 30% positive predictions suddenly outputs 60% positive predictions without a corresponding change in the real world, something has changed — possibly the input distribution, possibly the model, possibly both. Sudden shifts in output distributions often indicate data pipeline failures, sensor malfunctions, or unexpected population changes. Input distribution monitoring tracks the statistical properties of inputs arriving at the model. If the feature distribution of incoming data diverges significantly from the training distribution — measured using techniques like population stability index, maximum mean discrepancy, or KL divergence — this signals possible covariate shift before performance degrades. Ground-truth performance monitoring compares predictions to actual outcomes when ground truth is eventually available. For a loan default model, the ground truth (whether the borrower defaulted) arrives months after the prediction. Systems should continuously backfill ground truth and compute updated accuracy, precision, recall, and fairness metrics on recent data. User behavior signals can serve as proxies for model quality when ground truth is delayed or unavailable. If users of a recommendation system are clicking on fewer and fewer recommendations over time, this may indicate the model is drifting from their preferences. If a medical AI's flagged cases are being overridden by physicians at increasing rates, physicians may be detecting degraded quality before metrics do.

The Label Delay Problem

Many high-stakes ML systems make predictions whose outcomes cannot be verified immediately. A credit model predicts default, but default happens months later. A medical model diagnoses disease, but biopsy results come days later. A recidivism model predicts reoffending, but that outcome unfolds over years. This label delay means that by the time ground-truth performance degradation is detected, thousands of potentially wrong predictions have already been acted upon. Proxy metrics and input distribution monitoring are critical precisely because they can signal problems before ground truth arrives.

Drift Detection Methods

Automatically detecting distribution drift requires statistical tools. Population Stability Index (PSI) compares the distribution of a variable in training versus deployment data. Values below 0.1 suggest minimal drift; values above 0.25 indicate significant drift requiring investigation. PSI is widely used in financial modeling and credit risk. The Kolmogorov-Smirnov test is a non-parametric test that computes the maximum difference between two cumulative distribution functions. It can detect shifts in any distribution without assuming a specific shape. Applied continuously to incoming feature distributions, KS tests can flag drift in real time. Page-Hinkley test and CUSUM (Cumulative Sum) are sequential change-detection algorithms designed for streaming data. Rather than comparing two batches, they continuously accumulate signal and alarm when the cumulative signal exceeds a threshold. These methods detect drift with minimal latency and are appropriate for systems processing high-throughput data streams. When drift is detected, the response depends on its severity. Minor drift may trigger a flag and increased scrutiny without immediate action. Significant drift should trigger model revalidation on recent data. Severe drift should trigger model rollback — reverting to a previous version — or model suspension pending investigation.

Match each monitoring signal type to what it is designed to detect.

Terms

Prediction distribution shift
Population Stability Index
Ground-truth performance monitoring
User override rate
CUSUM sequential test

Definitions

Sudden change in model output frequencies suggesting input or pipeline problems
Proxy signal for model quality when ground truth is delayed or unavailable
Accuracy and fairness metrics computed on recent data once outcomes are known
Accumulates streaming signal to detect change-points in real-time data
Quantifies how much the distribution of an input feature has shifted since training

Drag terms onto their definitions, or click a term then click a definition to match.

Feedback Loops and Performativity

A subtle and dangerous property of deployed AI systems is that their predictions change the world, and the changed world becomes the next round of training data. This creates feedback loops — some benign, some catastrophic. A recommendation system that amplifies popular content makes that content more popular, which causes the system to recommend it even more. Over time, recommendation diversity collapses as the system optimizes for engagement in a shrinking feedback loop. A predictive policing system predicts high crime in certain areas. Police are deployed there, leading to more arrests. Those arrests appear as crime data, validating the prediction and increasing future deployment. The system is self-fulfilling: it does not predict crime rates, it influences them. A fraud detection system flags accounts as suspicious. Those accounts are suspended. Suspension prevents them from making purchases — which, from the system's perspective, looks like they have stopped fraudulent behavior, validating the flagging. The system cannot learn from its mistakes because its predictions alter the data it would need to evaluate them. Monitoring for feedback loops requires tracking not just model performance but the causal structure of the system's effect on the world. This is among the hardest monitoring problems in practice.

Silent Failure Is the Most Dangerous Failure Mode

The most dangerous way a deployed AI system can fail is silently — continuing to produce outputs confidently while its accuracy has degraded, with no signal that anything is wrong. Silent failure happens when no monitoring is in place, when monitoring metrics do not capture the relevant failure mode, or when drift is gradual enough that no single step triggers an alarm. Building systems that fail loudly — expressing uncertainty, flagging anomalous inputs, declining to predict on out-of-distribution cases — is a key design goal for safe deployable AI.

A credit scoring model was validated at 88% accuracy before deployment. Six months later, analysts notice it is approving far more applications than expected without a corresponding increase in business volume. The most likely explanation is:

A predictive policing model predicts high crime in neighborhoods where police subsequently increase patrols, leading to more arrests, which are fed back as training data. This is an example of:

Design a Monitoring Plan

  1. You are responsible for monitoring a deployed AI system after launch. Choose one: a content recommendation engine, a medical triage assistant, a loan approval model, or a student performance prediction tool used to allocate tutoring resources.
  2. Step 1: List four specific signals you would monitor, and for each, describe what normal looks like and what an anomaly would look like.
  3. Step 2: For each signal, specify how frequently you would check it (real-time, daily, weekly, monthly) and why that cadence is appropriate.
  4. Step 3: Identify one feedback loop the system could create and describe how you would detect it.
  5. Step 4: Write an incident response procedure: what would you do if you detected significant drift? At what threshold would you pause the system? Who needs to be notified?
  6. Step 5: Identify the single most critical gap in your monitoring plan — what failure mode might you still miss?