Memory Design Challenge
This lesson is a hands-on design challenge. You will apply everything from this module — context windows, short- and long-term memory, retrieval-augmented generation, state management, summarization, and failure handling — to design a complete, production-quality memory system for a realistic AI agent. There is no single right answer. There are well-reasoned answers and poorly-reasoned answers, and the goal is to practice the reasoning.
A strong submission demonstrates: (1) accurate classification of information by lifespan and query type, (2) deliberate storage strategy choices with tradeoffs acknowledged, (3) explicit write and read triggers for every external store, (4) realistic failure handling, and (5) a coherent architecture where all the pieces fit together and nothing is missing.
The Agent: MedPrep AI
MedPrep AI is an AI tutoring agent that helps pre-medical students prepare for the MCAT — a standardized exam required for US medical school admission. The MCAT tests biology, chemistry, physics, psychology, and critical analysis. Students use MedPrep AI over months, across hundreds of sessions. The agent's capabilities: Content tutoring: a student asks a question about any MCAT topic, the agent explains it clearly and gives examples. Practice question generation: the agent generates MCAT-style multiple choice questions, the student answers, the agent gives feedback. Progress tracking: the agent tracks which topics the student has studied, their accuracy per topic, and their improvement over time. Adaptive difficulty: the agent adjusts question difficulty based on the student's recent performance — harder questions for strong topics, easier for weak ones. Study planning: the agent recommends what to study next, based on what is weakest and how much time remains before the exam date. Full MCAT content corpus: the agent has access to a 4,000-page MCAT content review library.
Your Challenge
Design MedPrep AI's Memory System
- Work through each part of this challenge in order. Each part builds on the previous one. Aim for a complete, coherent architecture by the end.
- PART A — Information Inventory (15 minutes)
- List every distinct category of information MedPrep AI must track. For each category, specify:
- - What it is (a brief description)
- - Its lifespan: ephemeral (current step only), session (current session only), or persistent (survives across sessions)
- - Its approximate size in tokens if it were placed in the context window
- - Whether it is read-heavy (retrieved often), write-heavy (updated often), or balanced
- You should identify at least 8 distinct categories. Examples to get you started: the student's exam date, the current question being discussed, the student's lifetime accuracy on biochemistry questions. Find at least 5 more.
- PART B — Storage Strategy Assignment (10 minutes)
- For each category from Part A, assign a storage strategy: context window, relational database, vector database, key-value store, or hierarchical summary. Justify each choice in one sentence. Where two strategies are defensible, choose one and explain why.
- PART C — Write and Read Triggers (15 minutes)
- For every category stored outside the context window, write:
- - Write trigger: what event causes this to be saved, and in what format?
- - Read trigger: at what point in the agent loop is this retrieved and injected into the prompt?
- Be precise. 'When relevant' is not a read trigger. 'When the user asks a topic question and the similarity score exceeds 0.75' is a read trigger.
- PART D — RAG Design (10 minutes)
- MedPrep AI has a 4,000-page MCAT content corpus. Design the RAG pipeline:
- - What is your chunk size and why?
- - What metadata do you store alongside each chunk? (Hint: subject area, page number, topic tag)
- - How do you handle a query whose answer spans multiple chunks from different subjects?
- - What happens when a student asks a question that is not in the MCAT corpus at all?
- PART E — Context Window Budget (10 minutes)
- Your model has a 128,000-token context window. Assign a token budget to each component that will appear in the prompt during a typical tutoring session step:
- - System prompt (agent instructions and persona)
- - Student profile summary (preferences, exam date, current study plan)
- - Recent conversation history (last N turns)
- - Retrieved MCAT content chunks
- - Current student question
- Budgets must sum to no more than 128,000. State how many recent turns and how many retrieved chunks your budget supports.
- PART F — Failure Cases (10 minutes)
- Describe how your memory system handles each of these failures:
- 1. The vector database is unavailable when a student asks a biochemistry question
- 2. The student has 2,000 sessions of history — far more than any context window can hold
- 3. A generated summary of the student's progress contains an error, recording biochemistry accuracy as 45% when it is actually 78%
- 4. The student changes their exam date mid-session — your progress tracking and study plan now have the wrong deadline
- PART G — Architecture Diagram (10 minutes)
- Draw a diagram of your complete memory system. Show:
- - The agent loop (plan, act, observe, update)
- - All storage components (with labels)
- - Arrows showing when information flows from the agent to storage (write)
- - Arrows showing when information flows from storage to the prompt (read)
- Your diagram should be clean enough that another engineer could implement it from the diagram alone.
- PART H — Reflection (5 minutes)
- Answer in 2-3 sentences each:
- 1. What is the single greatest memory risk for MedPrep AI over a multi-month usage period, and why?
- 2. If you had to cut your design to the simplest possible version that still works, what would you keep and what would you drop?
- 3. What would change in your design if the model's context window were unlimited?
Evaluation Criteria
Strong designs share several characteristics. They are complete — no category of information from the agent's requirements is left without an assigned store, write trigger, and read trigger. They are specific — triggers are precise events, not vague conditions. They are honest about tradeoffs — every storage choice involves a tradeoff and strong designs name it. They handle failure gracefully — every external dependency has a fallback. And they are coherent — the budget math adds up, the stores do not duplicate each other unnecessarily, and the diagram matches the written design.
Production memory systems at companies like Anthropic, OpenAI, and Google are never designed once and left unchanged. They are iterated on as agent behavior is observed in the real world, failure modes are discovered, and token costs are measured. Your design here is a first version — good engineering means being honest about its current weaknesses and knowing what you would improve next.
A student's lifetime topic accuracy data is a persistent record updated after every practice session. Which storage strategy is most appropriate for this data, and why?
When designing the read trigger for retrieved MCAT content chunks, which specification is most precise and therefore most useful?