Skip to main content
Machine Learning & Deep Learning

⏱ About 20 min20 XP

Deploying and Monitoring Models

Training a model is only the beginning. A model that lives only on a researcher's laptop helps no one. Getting it into production — where it processes real user requests under real latency constraints, serves diverse inputs it was not trained on, and continues to behave correctly as the world changes — is an engineering discipline in its own right. The ML lifecycle does not end at a good validation loss; it loops continuously from deployment back to retraining.

From Trained Model to Production System

After training, a model exists as a collection of weight files and the code that defines its architecture. Deploying it means wrapping it in a service that accepts requests, runs inference, and returns results — reliably, quickly, and at scale. Serialization: the model is saved in a portable format. PyTorch models are commonly saved as TorchScript or ONNX (Open Neural Network Exchange), a vendor-neutral format that can be loaded by multiple inference runtimes. TensorFlow uses SavedModel format. The goal is a self-contained artifact that does not depend on the training codebase. Inference optimization: training and inference have different requirements. Training processes large batches and must compute gradients; inference typically processes one request at a time (or small dynamic batches) with strict latency targets. Optimizations include: Quantization: reducing weight precision from FP32 to INT8 (8-bit integers). A quantized model is 4x smaller and 2-4x faster at inference with modest accuracy loss. Post-training quantization applies to an already-trained model without retraining. Pruning: identifying weights close to zero and setting them exactly to zero, creating a sparse model. Sparse matrix operations can be accelerated on specialized hardware. Kernel fusion: combining multiple operations (e.g., matrix multiply + bias add + activation) into a single GPU kernel, reducing memory read-write overhead. Serving infrastructure: the inference service must handle concurrent requests. Common patterns include REST APIs (the model is a microservice receiving JSON requests), gRPC (a binary protocol with lower overhead), and streaming APIs (for generative models that produce tokens one at a time). Model servers such as NVIDIA Triton and TorchServe handle batching incoming requests dynamically — grouping several concurrent requests into one batch to improve GPU utilization. Scaling and failover: production services use load balancers to distribute requests across multiple model replicas. Autoscaling adds replicas during traffic spikes and removes them during quiet periods. If one replica crashes, the load balancer routes traffic to healthy replicas. Kubernetes is the standard orchestration system for managing these containerized services.

Inference Is Not Just Forward Propagation

In production, the inference path includes: input validation and preprocessing, batching, the model forward pass, output postprocessing, result logging, and latency measurement — all with error handling and fallback paths. The model weights are often the smallest part of the total system engineering.

Data drift and model decay: A model trained at time T reflects the statistical properties of data at time T. The real world changes. Users change their language patterns. Product lines change. Seasonal effects shift. Adversaries adapt. This phenomenon is called data drift (the input distribution shifts) or concept drift (the relationship between inputs and labels shifts — the same inputs now warrant different outputs). Either form degrades model accuracy over time. Drift detection requires monitoring. The key observables are: Input distribution monitoring: track statistical properties of incoming features — mean, variance, percentage of missing values, fraction of categorical values seen in training. Statistical tests (Kolmogorov-Smirnov test for continuous features, chi-squared test for categorical) can flag when the current distribution differs significantly from the training distribution. Libraries such as Evidently and WhyLabs automate this. Prediction distribution monitoring: track the distribution of model outputs. If a model that previously predicted 'positive' 30% of the time now predicts 'positive' 70% of the time, something has changed — either the input distribution or the model is behaving unexpectedly. Outcome monitoring (ground truth feedback): when labels become available after prediction — a fraud label is confirmed days after the transaction, a medical diagnosis is confirmed after biopsy — you can compute real-world precision, recall, and F1 on production data. This is the most accurate measure of deployed performance but requires waiting for ground truth and connecting prediction logs to outcome systems. Retraining and the ML loop: when drift is detected or performance degrades past a threshold, the model must be retrained on fresh data. Modern ML systems automate this loop: data ingestion feeds a feature store; triggers (scheduled retraining, drift alerts, performance SLA violations) initiate a new training run; the new model is evaluated in a shadow deployment (running alongside the production model but not serving live traffic); if it passes validation, a canary deployment gradually routes increasing traffic percentages to it; if metrics stay healthy, it becomes the new production model. This pipeline is called MLOps (Machine Learning Operations), analogous to DevOps for software.

Prompt Challenge

Write a monitoring alert specification for a deployed credit-risk model

Your prompt should…

  • Identify a specific observable metric that should be tracked and name its normal operating range
  • Specify the threshold or condition that should trigger an alert with a concrete number
  • Explain what action a human operator should take when the alert fires

The Pipeline as a Loop

A deployed model is not a finished product — it is a snapshot of knowledge at a point in time, embedded in a system that must continuously renew that knowledge. The stages of the loop are: 1. Data collection: production traffic is logged, subject to privacy constraints and data retention policies. 2. Labeling: new examples are labeled (automated where possible, human-annotated where necessary). 3. Feature engineering: raw data is transformed into the features the model expects, using the same preprocessing code as training (enforced by a feature store to prevent train-serve skew). 4. Training: the model is retrained on fresh data, often including a mix of old and new examples. 5. Evaluation: the new model is benchmarked against the current production model on held-out recent data. 6. Deployment: canary rollout, shadow testing, full promotion. 7. Monitoring: return to step 1. Train-serve skew is a particularly insidious failure mode: when the preprocessing applied at inference time differs from what was applied during training. If training normalizes feature X by subtracting the mean computed from training data, but inference normalizes by subtracting a different constant (a code mismatch, a stale value), the model receives inputs that look nothing like what it trained on, degrading accuracy silently. Feature stores that serve both training and inference from the same code path prevent this.

Silent Failures Are the Worst Failures

A crashed server raises an alert immediately. A model that silently becomes less accurate over three months raises no alert — unless you built monitoring. The absence of errors is not evidence of correct behavior. Production ML systems must be monitored continuously; the model's performance is a system health metric, not a one-time benchmark result.

A production model's input feature 'user_age' has a mean of 34 in training data. After six months in deployment, the monitored mean is 41. What is the most likely explanation and recommended action?

What is train-serve skew, and why is it dangerous?

Design a Monitoring Plan

  1. Step 1. Imagine you have deployed a model that predicts whether a new social media post violates community guidelines. It processes 10 million posts per day.
  2. Step 2. Identify three input features you would monitor for drift (e.g., average post length, fraction of posts in non-English languages, fraction containing URLs).
  3. Step 3. For each, specify what a significant drift would look like (a concrete threshold or percent change).
  4. Step 4. Ground-truth labels (human moderator decisions) are available for 0.1% of posts within 48 hours. How would you use these to monitor real-world precision and recall?
  5. Step 5. A news event causes an unusual spike in posts about a new political topic. How should your monitoring system distinguish this from problematic drift?
  6. Step 6. Write a one-paragraph retraining trigger policy: under what conditions would you initiate a retraining run?