Skip to main content
AI Foundations

⏱ About 20 min20 XP

Deployment and Monitoring

Training a model feels like finishing a project. It is not. A model sitting on a developer's laptop, evaluated only on historical test data, does nothing for anyone. Deployment is the act of putting a model into a system where it receives real-world inputs and produces outputs that affect real decisions. And deployment is not the end either — it is the beginning of a new set of responsibilities. The world changes. Users change. The data the model encounters after deployment will never be identical to the data it was trained on. Understanding deployment and monitoring is what separates a working ML system from an ML experiment that was once accurate.

Deployment Architectures

Models are deployed in many forms, depending on latency requirements, scale, and infrastructure. A REST API is the most common pattern for general-purpose deployment. The trained model is wrapped in a web server; client applications send HTTP requests with input features and receive predictions in the response. This decouples the model from the applications that use it and allows the model to be updated independently. Latency is typically in the tens of milliseconds range — fast enough for most applications. Batch scoring processes large volumes of inputs at scheduled intervals rather than one at a time. A bank might score all customer accounts overnight to generate risk flags for the next business day. Batch scoring can use more computation per prediction because time is not critical, and it is simpler to manage than a live API. On-device deployment embeds the model directly on a user's phone, sensor, or edge hardware. The model runs locally, without a network connection. This is used for applications requiring extreme low latency (voice assistants, AR effects) or where user privacy prevents sending data to a server. The trade-off: on-device models must be smaller and simpler, because they run on constrained hardware. Model serving infrastructure — frameworks like TensorFlow Serving, TorchServe, or managed platforms like AWS SageMaker — handles load balancing, versioning, and rollout of new model versions. Deploying a model in production at scale is a software engineering problem as much as an ML problem.

The Model Is Not the System

A deployed ML system includes the model, the data pipeline feeding it inputs, the pre- and post-processing code, the serving infrastructure, and the downstream decisions that act on its outputs. Failures can occur in any layer. Evaluating only the model in isolation misses the other points of failure that matter in production.

Data Drift and Model Decay

A model trained on data from one period is a snapshot of the world at that time. The world does not hold still. Data drift is the phenomenon where the statistical distribution of real-world inputs shifts away from the training distribution. When drift is large enough, model performance degrades — sometimes gradually, sometimes suddenly. There are two flavors of drift to understand. Feature drift (or covariate shift) occurs when the distribution of inputs changes, even if the relationship between inputs and outputs stays the same. A model trained on pre-pandemic spending patterns will encounter very different spending patterns post-pandemic. Concept drift is more severe: the underlying relationship between inputs and outputs changes. A model predicting whether a news headline is clickbait must adapt as the definition of clickbait itself evolves with reader expectations. Real example: during the early months of COVID-19 in 2020, credit card fraud detection models trained on pre-pandemic behavior failed at elevated rates. The sudden, dramatic change in where people shopped, when, and for what fell outside the models' training distributions. Banks that detected this through monitoring reacted within weeks; banks without monitoring discovered the problem only when fraud losses spiked. Model decay is the gradual degradation of performance over time due to drift. Even a slowly drifting world eventually makes an unmonitored model unreliable.

Silent Failures

In most deployed systems, nobody will automatically tell you when a model starts performing poorly. Users may not know they are getting bad predictions. Downstream processes may not flag errors. Without active monitoring, model decay goes unnoticed until it causes a visible, often expensive, failure. Monitoring is not optional — it is the mechanism by which you learn that the pipeline needs to loop.

Monitoring in Practice

Effective ML monitoring tracks multiple signals simultaneously. Input distribution monitoring watches the statistical properties of incoming features — means, variances, and distributions of categorical values — and alerts when they deviate significantly from training-time baselines. This detects feature drift before it degrades performance. Output distribution monitoring tracks what the model is predicting — what fraction of inputs are classified as each class, or the distribution of predicted probabilities. A sudden shift in prediction distribution (e.g., the fraud model flags twice as many transactions as usual) is a strong signal that either the model or the inputs have changed. Prediction quality monitoring is the most direct measure: compare model predictions to actual outcomes as those outcomes become available. For a loan default model, outcomes are known within 12 months. Comparing predictions to outcomes on a rolling basis gives you a real-time view of whether precision and recall are holding up. Once monitoring detects a problem, the response depends on severity. Minor drift may require only recalibration. Significant concept drift requires collecting fresh labeled data and retraining. In extreme cases, the framing and features of the model must be reconsidered from the beginning — a full restart of the pipeline loop.

Match each deployment or monitoring concept to its description.

Terms

REST API deployment
Batch scoring
Feature drift
Concept drift
Early stopping

Definitions

Processing large volumes of inputs at scheduled intervals, not in real time
Model wrapped in a web server that responds to individual prediction requests
The distribution of input features shifts away from the training distribution
Halting model training when validation performance begins to worsen
The relationship between inputs and outputs itself changes over time

Drag terms onto their definitions, or click a term then click a definition to match.

A recommendation model trained in January begins predicting poorly by March. Investigation reveals that users' content preferences shifted significantly after a major news event. What type of drift is this?

Why is on-device deployment preferable for a real-time voice assistant, even though it constrains model size?

Design a Monitoring Plan

  1. You are responsible for a deployed model that predicts whether rental property listing prices are within 10% of fair market value — a tool used by city housing inspectors.
  2. Design a monitoring plan by answering each of the following:
  3. 1. Input monitoring: Name three specific input features you would monitor, and describe the alert condition for each (e.g., 'alert if the mean of feature X shifts by more than 2 standard deviations from training baseline').
  4. 2. Output monitoring: What output distribution shift would concern you? Write a specific alert condition.
  5. 3. Outcome monitoring: How would you measure real prediction quality over time, given that 'true fair market value' is hard to know? Propose a proxy approach.
  6. 4. Response plan: Describe in 3-4 sentences what you would do if your monitoring system alerted at 6 months post-deployment.
  7. 5. Drift risk: What real-world events would most likely cause concept drift for this model? Name two.