Inference, Context, and Compute
Training a large language model is the most expensive step, but once trained, the model must be run millions or billions of times to serve users. This is called inference: the process of running a trained model on new input to produce output. Understanding inference is essential for understanding the economics, limitations, and design trade-offs of deployed AI systems. Inference is not simply the reverse of training. Training adjusts parameters; inference holds parameters fixed and propagates input through the model to produce an output. But in language models, inference has a specific structure, autoregressive generation, that creates important constraints and opportunities.
Autoregressive Generation
A language model does not generate an entire response in one forward pass. It generates one token at a time, appending each generated token to the context and then running another forward pass to generate the next token. This is called autoregressive generation, and it is an important architectural consequence of next-token prediction pretraining. Here is the process for generating a response to the prompt 'Explain photosynthesis in one sentence.' Step 1: The model runs a forward pass on the full prompt. Its output is a probability distribution over the vocabulary: maybe 'Photosynthesis' has probability 0.45, 'Plants' has probability 0.22, and so on. The model samples or selects the highest-probability token, say 'Photosynthesis.' Step 2: 'Photosynthesis' is appended to the context. The model runs another forward pass on 'Explain photosynthesis in one sentence. Photosynthesis' and produces a new distribution. The next token is selected, perhaps 'is.' Step 3: This repeats until the model generates an end-of-sequence token or reaches a length limit. The key consequence: generating a response of length L tokens requires L sequential forward passes through the entire model. These passes cannot be fully parallelized because each depends on the previous token. This is fundamentally different from training, where many token predictions can be parallelized within a forward pass. Inference latency, the time from query to complete response, scales with response length and model size.
During training, many next-token prediction targets in a sequence can be computed in a single forward pass, enabling massive parallelism. During inference, each token must be generated sequentially because each new token depends on all previous tokens. This asymmetry means inference is inherently slower per token than training and cannot be parallelized the same way.
The context window is the total amount of text the model can process in one inference call: the prompt plus all tokens generated so far. Every token in the context window participates in the self-attention computation for each new token being generated. If the context window is 128,000 tokens and the model has generated 60,000 tokens of a response, the next token's generation requires computing attention over all 128,000 tokens currently in context. This is why long contexts are computationally expensive: attention cost scales quadratically with context length. A response generated in a 128,000-token context is not just more expensive than one in a 4,000-token context because of the extra tokens generated. It is also more expensive per token, because each step of generation must attend over a longer context. KV caching is the primary technique for managing this cost. During a forward pass, the model computes intermediate representations called keys (K) and values (V) for each token in the context. These are reused for every subsequent forward pass on the same context. Rather than recomputing K and V from scratch for all previous tokens at every step, the model stores them in a cache and only computes them fresh for the new token being appended. KV caching dramatically reduces the computation per generated token but requires significant memory: a 128,000-token context for a large model may require tens to hundreds of gigabytes of KV cache. Batch size is another inference variable. Rather than serving one user at a time, inference systems typically process requests from multiple users simultaneously (batching). Larger batches improve hardware utilization but increase latency for each individual user. Inference optimization is an entire engineering discipline balancing throughput, latency, cost, and model accuracy.
Match each inference concept to its correct description.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Test-Time Compute and Reasoning Models
A significant research direction in 2024-2026 has been scaling inference-time compute rather than training-time compute. The traditional paradigm is: train a bigger model (more training compute) to get a better model. The emerging alternative: use more compute at inference time on a given model to get better answers. The most prominent implementation of this idea is chain-of-thought reasoning, where the model is prompted or trained to reason step by step before producing a final answer. More compute at inference time means more tokens generated in the reasoning chain, which on many tasks (mathematics, formal logic, coding) produces substantially better final answers. OpenAI's o1 and o3 models, released in 2024-2025, were explicitly designed around this principle: they generate extended internal reasoning traces before producing their final answer, using considerably more inference compute per query than a standard model, in exchange for significantly better performance on complex reasoning tasks. Anthropic's extended thinking in Claude 3.7 Sonnet follows the same paradigm. This creates a new design variable for deployers: how much inference compute should be spent per query? For simple questions, a short response from a small model is optimal. For complex mathematical proofs or multi-step planning tasks, spending 10x or 100x more inference compute on a larger reasoning model may be worth the cost. The economics of AI deployment are increasingly shaped by this trade-off between per-query compute cost and answer quality.
Training compute is spent once and produces a model. Inference compute is spent per query and can be amortized across millions of users. Scaling training compute requires massive upfront investment but reduces per-query cost if the model is widely used. Scaling inference compute is pay-per-use: higher quality answers on demand at higher per-query cost, with no upfront training investment.
A language model with a 128,000-token context window is used to process a 100,000-token legal document and generate a 2,000-token summary. A second document is 10,000 tokens and requires the same 2,000-token summary. Which task is more computationally expensive per generated token, and why?
A company is building an AI-powered tutoring system that must answer students' questions. Some questions are simple factual lookups; others are complex multi-step math problems. The company is deciding between Model A (small, fast, cheap) and Model B (large reasoning model, slow, expensive). What is the optimal deployment strategy?
Model an Inference Pipeline
- Design and analyze an inference pipeline for a specific application.
- Scenario: You are building a system that processes customer support tickets for a software company. Tickets arrive at a rate of 1,000 per hour. About 60% are simple questions answerable with a short factual response. About 30% require moderate reasoning about the software's behavior. About 10% are complex multi-step debugging problems.
- Step 1: Sketch a three-tier routing architecture. Define what each tier handles, which model size fits each tier, and what the typical context length and response length would be.
- Step 2: For each tier, estimate the relative compute cost per query. Use 1x as the baseline for the simple tier. If the complex tier uses a model 10 times larger and generates 5 times more tokens in a 5 times larger context, estimate its relative cost.
- Step 3: Calculate the weighted average compute cost per ticket for your three-tier system. Compare it to the cost of routing all tickets to the complex model.
- Step 4: What accuracy requirement would you need at the routing layer for the three-tier system to reliably outperform a single-model system? What happens if the router makes mistakes?
- Step 5: Discuss one case where a simple question might actually benefit from complex reasoning (for example, a question that looks simple but has a subtle edge case). How would you design the system to handle this?