How a Network Learns
You have watched data flow forward through a network and arrive at a prediction. But predictions start out wrong — often badly wrong. A freshly initialized network with random weights is guessing randomly. Learning is the process of taking those wrong answers and using them to make the weights a little better, over and over, millions of times. The algorithm that does this is called backpropagation, and it is one of the most important ideas in all of modern AI.
Measuring the Mistake: The Loss Function
Before the network can improve, it needs to measure exactly how wrong it is. This measurement is called the loss (sometimes called the error or cost). The loss function compares the network's prediction to the correct answer and returns a single number: zero means perfect, higher means worse. A common loss function for regression (predicting a number) is mean squared error — square the difference between predicted and actual, then average across all training examples. Squaring makes all errors positive and punishes large mistakes more harshly than small ones. For classification (predicting a category), a common loss is cross-entropy: it measures how surprised the network should be by the correct answer. If the network was 95% confident in the right answer, cross-entropy is very low. If it was 5% confident in the right answer (meaning 95% confident in something wrong), cross-entropy is very high. The goal of training is to minimize the loss — to find the set of weights where the loss is as small as possible.
Backpropagation is the algorithm that computes how much each weight in the network contributed to the current loss. It works backward from the output layer to the input layer using calculus (the chain rule), assigning each weight a gradient — a number that says which direction and how far to nudge the weight to reduce the loss.
Here is the conceptual picture of backpropagation in four steps: Step 1 — Forward pass. Run the training example through the network and get a prediction. Step 2 — Compute the loss. Compare the prediction to the correct answer using the loss function. Say the prediction was 0.3 but the correct answer was 1.0. The loss is large. Step 3 — Backward pass. Starting from the loss, work backward through every layer. For each weight, compute its gradient: how much does the loss change if we nudge this weight up by a tiny amount? A large positive gradient means nudging the weight up makes the loss worse — so nudge it down instead. A large negative gradient means nudging it up reduces the loss — so do that. Step 4 — Update the weights. Nudge every weight by a tiny amount in the direction that reduces loss. The size of the nudge is controlled by the learning rate — a small number like 0.001. Too large a nudge and the network overshoots and diverges. Too small and learning takes forever. Repeat millions of times across thousands of training examples. The weights gradually settle into values that minimize the loss — and at that point, the network has learned.
Gradient Descent: Walking Downhill
The process of nudging weights toward lower loss has a name: gradient descent. Imagine the loss as a landscape of hills and valleys. Every set of weights corresponds to one point in that landscape. The gradient tells you which direction is uphill. Gradient descent always steps downhill — adjusting weights to reduce the loss, one small step at a time. The challenge is that this landscape has billions of dimensions (one per weight) and is filled with valleys, ridges, and flat regions. Modern deep learning uses variations like mini-batch gradient descent (update weights after each small batch of examples, not the entire dataset) and Adam (an optimizer that adapts the learning rate per weight). But the core idea never changes: measure the loss, figure out which way is downhill for each weight, take a small step.
Flashcards — click each card to reveal the answer
If a network trains too long on the same examples, it can memorize them perfectly without learning general patterns — like a student who memorizes practice test answers verbatim rather than understanding the subject. On new examples it has never seen, it fails. This is called overfitting. The solution is to validate on data the network has never trained on.
What does the loss function produce?
Why is the learning rate kept small (like 0.001) rather than large (like 10)?
The Guessing Game Gradient
- Step 1: Pick a secret number from 1 to 100. Do not reveal it.
- Step 2: A partner guesses a number. You say 'too high' (loss too high — nudge down) or 'too low' (loss too high — nudge up) or 'exactly right' (loss = 0).
- Step 3: Your partner adjusts their guess by exactly 10 each time — the learning rate is 10.
- Step 4: Count how many guesses it takes to get within 2 of the secret number.
- Step 5: Replay with a learning rate of 2. Compare the number of steps.
- Step 6: Discuss: what would happen with a learning rate of 50? Why might the guess keep jumping past the answer?