How a Network Learns
So far you have seen what a neural network is made of (neurons and weights) and how it makes predictions (the forward pass). But networks do not arrive pre-loaded with useful weights. They start with random numbers and have to discover the right configuration on their own — by looking at examples with known answers, comparing their predictions to those answers, and adjusting every weight in the direction of improvement. This adjustment process is called training, and the algorithm that drives the weight adjustments is called backpropagation. Understanding it, at least conceptually, is one of the most important things you can learn about how modern AI works.
The Training Loop: Three Phases
Training a neural network is a cycle that repeats thousands or millions of times. Each cycle looks like this: Phase 1 — Forward pass. Feed a training example (an input with a known correct answer) into the network and get its current prediction. If you are training an image classifier, the input might be a photo of a dog, and the correct answer is 'dog.' Phase 2 — Measure the error. Compare the network's prediction to the correct answer. This comparison is done by a loss function — a formula that produces a single number representing how wrong the prediction was. If the network said '73% dog, 15% cat, 12% bird' and the correct answer is 'dog,' the loss is small. If it said '12% dog, 88% cat,' the loss is large. The goal of training is to make the loss as small as possible. Phase 3 — Backpropagation. Here is where the magic happens. Starting from the output layer and working backward through the network, the algorithm computes how much each weight contributed to the error. Weights that made the error bigger get adjusted in the direction that reduces their contribution; weights that barely mattered get adjusted less. Then the cycle repeats with the next training example.
Backpropagation is an algorithm that computes, for every weight in the network, how much adjusting that weight would reduce the prediction error. It uses the chain rule from calculus to efficiently compute these adjustments even for networks with millions of weights. You do not need to know the calculus to understand the idea: it is the process of tracing errors backward through the network to find which weights to blame, then nudging each one to do better.
The 'nudge' applied to each weight is controlled by a number called the learning rate. If the learning rate is too large, weights jump around wildly and the network never settles on good values — like trying to tune a guitar string by twisting the peg too fast. If the learning rate is too small, training takes forever — like moving a millimeter at a time toward a destination miles away. Choosing a good learning rate is one of the key decisions in training a neural network, and researchers have developed many clever techniques to adjust it automatically during training. The combination of a loss function, backpropagation, and a learning rate is called gradient descent. The 'gradient' is a mathematical direction: the direction in which the loss increases most sharply. Gradient descent moves the weights in the opposite direction — downhill, toward lower loss. Repeat this enough times, with enough training examples, and the weights eventually settle into values that produce good predictions on new data.
What the Network Is Actually Learning
It is worth being precise about what 'learning' means here. The network is not learning facts in the way you memorize a definition. It is adjusting numbers until its outputs match the training data. The goal is not to memorize the training data — that would be useless. The goal is to find weights that generalize: that work correctly on new examples the network has never seen before. When a network generalizes well, it has found weights that captured something real about the structure of the problem — not just the specific examples it trained on. When it fails to generalize, we say it has overfit: it memorized the training examples (including their noise and quirks) rather than learning the underlying pattern. Overfitting is one of the most common failures in machine learning, and avoiding it requires using separate validation and test data that the network never trains on.
A network that perfectly memorizes its training data but fails on new examples is useless. The whole point of training is generalization — learning something true about the world from examples, not just echoing those examples back. This is why you always need a test set: data the model never trained on, to check whether it actually learned or just memorized.
Match each training concept to its correct description.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
What is the role of the loss function during training?
Why do we need a separate test set that the network never trains on?
Human Backpropagation
- Play the role of a neural network being trained. A partner (or the instructions below) is the trainer.
- The task: you predict whether a sentence is formal or casual. After each prediction, you get feedback.
- Sentence 1: 'I would be grateful for your assistance.' You predict: ___. Correct answer: Formal.
- Sentence 2: 'Hey, wanna grab pizza?' You predict: ___. Correct answer: Casual.
- Sentence 3: 'The committee has reviewed the proposal.' You predict: ___. Correct answer: Formal.
- After each answer, the trainer tells you if you were right or wrong. If wrong, you must say: 'I think I weighted the [word/feature] too high or too low. I will adjust my rule.'
- After all three rounds, write down one 'rule' (a feature of the sentence) you adjusted during the exercise.
- This is conceptually what backpropagation does: measure the error, trace it back to the features that caused it, and adjust their influence (weights) for next time.