Skip to main content
AI Agents & Automation

⏱ About 20 min20 XP

The Context Window

Every transformer model has a context window — a fixed maximum number of tokens it can process in a single forward pass. Tokens are the basic units text is split into before a model reads it: roughly speaking, one token is about four characters of English, so 1,000 tokens is approximately 750 words. The context window sets an absolute ceiling on everything the model can see and reason about at once. It cannot attend to anything outside it.

Context window sizes have grown dramatically. Early GPT-2 had a 1,024-token limit. GPT-3 extended this to 4,096. GPT-4 Turbo reached 128,000 tokens — roughly 100,000 words, or the length of a full novel. Claude 3 supports up to 200,000 tokens. Some research models push into the millions. Despite this growth, the window remains finite, and understanding what goes inside it is critical for agent design.

Tokens Are Not Words

Tokenization splits text at subword boundaries, not word boundaries. 'unhappiness' might become ['un', 'happiness'] or ['unhappy', 'ness'] depending on the vocabulary. Code, JSON, and non-English languages tokenize differently — often less efficiently. A JSON payload that looks like 200 words might consume 600 tokens. Always measure in tokens, not words, when reasoning about context limits.

What Lives Inside the Context Window

The context window is not a private notebook — it is the entire visible universe of a single model call. In an agent setting, that window typically contains several distinct layers of information, all competing for the same finite space. The system prompt occupies the first layer: instructions, persona definition, rules, and any background knowledge the developer injected at agent startup. This might run from a few hundred to several thousand tokens depending on how elaborate the agent's instructions are. Conversation history follows: everything the user said and the model replied, all the way back to the start of the session — unless the agent has truncated or summarized it. This grows without bound as the session continues. Tool call records come next: the structured output of every tool the agent invoked — web search results, database rows, file contents, API responses. These can be very large. A single web page retrieved by a browsing tool might consume 10,000 tokens. Finally, the current user message and any dynamic context the agent injects — retrieved documents, fresh facts — fill the remaining space before the model generates its response.

Match each piece of content to the layer of the context window it belongs to.

Terms

You are a research assistant. Always cite sources.
User: What did you find in step 2? Agent: I found three candidate papers.
Search result: 'NASA confirms record Arctic ice loss in 2024...' (3,400 tokens)
Three retrieved paragraphs injected by the retrieval module
User: Now summarize everything and suggest next steps.

Definitions

System prompt
Dynamic context from retrieval
Conversation history
Tool call output
Current user message

Drag terms onto their definitions, or click a term then click a definition to match.

Hard Limits and Soft Degradation

When a prompt exceeds the context window, the API returns an error and the call fails entirely. This is the hard limit. But there is a softer failure mode that happens well before the limit is reached: attention degradation. Transformer attention is theoretically uniform across the window, but in practice models perform worse when critical information is buried in the middle of a very long context. Research has called this the 'lost in the middle' problem. Instructions given at the beginning and end of a long prompt tend to be followed better than identical instructions buried thousands of tokens deep. For agents injecting large tool outputs into the middle of the context, this is a real reliability hazard — the model may appear to 'ignore' a retrieved document even though it fits within the window.

Lost in the Middle

Studies show that language models attend best to information at the very start and very end of a long context window. Critical instructions or retrieved facts placed in the middle of a 100,000-token prompt may be effectively invisible to the model's reasoning. Place the most important content at the edges, not the center.

Why Context Size Is Not a Free Lunch

Larger context windows are strictly better from a capacity standpoint, but they come with costs. Transformer attention has quadratic complexity with respect to sequence length in naive implementations — doubling the context requires four times the compute. Modern techniques like FlashAttention, sliding window attention, and sparse attention reduce this, but long-context processing is still significantly more expensive per token than short-context processing. For agents this creates a direct economic tradeoff: putting everything in the context is simple and correct but expensive. Selectively retrieving only the most relevant information saves cost and often improves reliability by reducing noise. This tradeoff — capacity versus cost versus relevance — drives the architectures you will study in the rest of this module.

A context window is measured in , not words. In a single model call, the model can only attend to information that window. When the prompt exceeds the limit, the API returns an , and the call fails entirely. Even before that limit, placing critical content a very long context can cause it to be effectively ignored.

An agent has a 128,000-token context window. Its system prompt uses 2,000 tokens, the conversation history uses 40,000 tokens, and it wants to inject a 90,000-token document. What happens?

Why is placing a critical instruction in the middle of a 100,000-token prompt riskier than placing it at the start?

Context Window Budget

  1. You are designing an agent with a 128,000-token context window. Your agent must handle: (a) a system prompt with detailed instructions, (b) a conversation history that grows over time, (c) tool outputs from web searches and file reads, and (d) the current user message.
  2. Step 1: Assign a token budget to each of these four categories. Your budgets must sum to no more than 128,000. Write your reasoning for each allocation.
  3. Step 2: Estimate how many turns of conversation history your budget supports if the average exchange is 500 tokens. How many web pages (averaging 8,000 tokens each) can you include in a single call?
  4. Step 3: What happens to your budget as the conversation runs for 50 turns? Where does it break?
  5. Step 4: Propose two strategies to prevent the budget from overflowing at turn 50. You will study both strategies in depth in later lessons — for now, describe them in plain English.