Capstone Project: Build & Test a Model
You have completed all five modules of AI Foundations Tier 2. You know how machines learn from data, how neural networks process information layer by layer, how generative AI creates new content from patterns, and why ethics must be built into every system from the start. Now it is time to put all of that together. In this capstone you will do what real AI researchers do: choose a problem, collect training examples, train a model, and then test it — honestly — on data it has never seen before. Half this project is building. The other half is finding out whether what you built actually works.
Your Mission
Your mission has three phases: Build, Test, and Write Up. In Phase 1 you will train a classifier using the Teachable Machine lab. In Phase 2 you will evaluate it using a held-out test set — examples your model has never seen during training. In Phase 3 you will write a short research-style report that honestly describes what your model got right, what it got wrong, and what you would change. The most important word in this capstone is HONEST. Anyone can report a number that sounds good. A real engineer reports the number that is true, then explains it.
A model is only as trustworthy as its evaluation. If you test a model on the same data you used to train it, you are not measuring learning — you are measuring memorization. The only honest measure of a model's ability is its performance on examples it has never encountered before.
Phase 1: Build
Train Your Classifier
- Step 1: Choose a classification task with at least two distinct categories. Good examples: identify hand signs for letters A, B, and C; sort drawings of cats vs. dogs; recognize thumbs-up vs. thumbs-down gestures. Pick something where you can collect 30 or more examples per class.
- Step 2: Before you collect a single example, write down exactly what makes each class different. These distinguishing characteristics are your features. Knowing your features in advance helps you collect better training data.
- Step 3: Open the Teachable Machine lab at /institute/lab/teachable-machine. Create a project with your chosen classes. Collect at least 30 examples per class — more variety is better than more copies of the same image. Vary the background, lighting, and angle.
- Step 4: Set aside 20% of your examples as your TEST SET before you train anything. Put those aside in a separate folder or label them clearly. Do not use them during training under any circumstances.
- Step 5: Train your model on the remaining 80% of examples. Once training is complete, open the Neural Net Visualizer lab at /institute/lab/neural-net-visualizer and explore how a network with similar inputs and outputs distributes activation across its layers. Note which layer seems to do the most work for separating your classes.
- Step 6: Record your training configuration — number of classes, number of training examples per class, number of epochs if shown, and any settings you adjusted.
Phase 2: Test
Training accuracy — the score your model gets on its own training data — is almost always high, and almost always misleading. The training data is familiar. The model has already adjusted its weights to handle exactly those examples. Reporting training accuracy as your final result is like studying the answer key and then taking the same test: the score tells you nothing useful. The only number that matters is test accuracy: how well your model performs on the held-out examples it has never seen before. This is called generalization — the ability to apply learned patterns to new situations. Every legitimate machine-learning paper, product review, and benchmark uses held-out test data for exactly this reason. Your test set must stay completely separate from your training set. If even one test example leaks into training, the result is contaminated and you must start over.
Evaluate Your Model Honestly
- Step 1: Take out your held-out test set — the 20% of examples you set aside before training.
- Step 2: Run each test example through your trained model one at a time. For each example, record: the correct class label, the class your model predicted, and whether that prediction was correct (yes or no).
- Step 3: Count the total number of test examples. Count how many the model predicted correctly. Divide correct predictions by total examples and multiply by 100. This is your test accuracy percentage.
- Step 4: Build a confusion matrix — a simple grid. Rows represent the actual class; columns represent the predicted class. Fill in how many examples landed in each cell. A perfect model has all counts on the diagonal and zeros everywhere else.
- Step 5: Look at your off-diagonal cells. Which class did your model most often confuse with which other class? Write a one-sentence hypothesis explaining why that confusion happened — think about the features that those two classes share.
- Step 6: Record your final test accuracy and your confusion matrix. These two pieces of information are the core of your write-up.
If your model scores very high on training data but much lower on your test set, it has overfit. Overfitting means the model learned the specific quirks of your training examples — the particular background color, the exact lighting, the precise angle you always used — rather than the underlying pattern. An overfit model is not intelligent; it has memorized. The fix is more diverse training data, fewer training epochs, or a simpler model architecture.
Phase 3: Write Up Your Findings
A write-up is not a victory lap. It is an honest account of what you built, what happened when you tested it, and what you learned — including the parts that did not go as planned. In professional AI research, a paper that presents failures alongside successes is more valuable than one that only shows the best results, because failures teach other researchers what not to try. Your write-up should be structured into four sections: Section 1 — What I Built: describe your task, your classes, how many training examples you collected, and what the model's architecture looks like at a high level. Section 2 — How I Tested It: describe your test set — how many examples, how you kept it separate, and how you computed your accuracy. Section 3 — What the Numbers Say: report your test accuracy and your confusion matrix. Identify the most common error type and your best hypothesis for why it happens. Section 4 — What I Would Do Differently: propose at least two concrete improvements — specific changes to your data, your training process, or your class definitions that you predict would raise test accuracy.
Write Your Research Report
- Step 1: Open a blank document. Write a title that includes your task name and the date.
- Step 2: Write Section 1 (What I Built) in three to five sentences. Be specific: name your classes, state how many training examples you collected for each, and describe one feature that distinguishes each class.
- Step 3: Write Section 2 (How I Tested It) in two to three sentences. State exactly how many examples were in your test set and confirm that none of them were used during training.
- Step 4: Write Section 3 (What the Numbers Say). Include your test accuracy as a percentage and paste or draw your confusion matrix. Describe your most common error and your hypothesis in one paragraph.
- Step 5: Write Section 4 (What I Would Do Differently). Propose two concrete improvements. For each improvement, explain the mechanism — why do you predict this change would reduce the error you identified?
- Step 6: Read your write-up out loud. Anywhere you used a vague word like 'good' or 'bad,' replace it with a specific observation. Vague language hides unclear thinking.
Flashcards — click each card to reveal the answer
Match each AI concept to its correct definition.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
You train a model that scores 98% on your training data but only 61% on your test data. What does this most likely mean?
A classmate says: 'I got 95% accuracy, so my model is great.' What is the most important follow-up question you should ask?
Prompt Challenge
Write a prompt you could give a generative AI assistant to help you brainstorm ways to improve a model that confuses two specific classes.
Your prompt should…
- Name the two classes that are being confused
- Describe what features the two classes share that might cause confusion
- Ask for at least three concrete data-collection strategies to help the model tell them apart
You have done what professional machine-learning engineers do every day: defined a problem, collected data, trained a model, evaluated it honestly, and reflected on how to improve it. The habit of honest evaluation — measuring what is true, not what is flattering — is what separates good engineers from great ones. Every AI system in the world would be safer and fairer if every person who built it held themselves to that standard.
Congratulations on completing AI Foundations. You have built real intuition for how machine learning works, why data quality determines model quality, and why ethics is not an add-on but a foundation. In the High School tier, you will go further — writing code, training models programmatically, and diving into the mathematics that makes all of this possible. See you there.