Backpropagation
Gradient descent needs the gradient dL/dw for every weight w in the network. In a deep network with millions of weights, computing these gradients by brute force — perturbing each weight individually and measuring the change in loss — would be impossibly slow. Backpropagation is the algorithm that computes all gradients simultaneously and efficiently, in one backward pass through the network. It rests entirely on the chain rule of calculus. This lesson derives it carefully, with a worked numerical example.
The Chain Rule, Reviewed
The chain rule states: if y depends on u, and u depends on x, then: dy/dx = (dy/du) * (du/dx) For a longer chain — y depends on u, u depends on v, v depends on x: dy/dx = (dy/du) * (du/dv) * (dv/dx) This is the engine of backpropagation. The loss L depends on the output of the last layer, which depends on the previous layer, which depends on the one before — a long chain of functions. Backpropagation applies the chain rule to decompose dL/dw^(l) into a product of local derivatives at each layer, computed efficiently.
Define delta^(l) as the gradient of the loss with respect to the pre-activation z^(l) at layer l. It is called the error signal or local gradient. Backpropagation works by computing delta^(L) at the output layer first, then propagating it backward: delta^(l) is computed from delta^(l+1) times the weights W^(l+1) times the local derivative of the activation function. Once you have delta^(l), the gradient of L with respect to W^(l) is simply the outer product of delta^(l) and h^(l-1).
Full derivation for a two-layer network. Setup: input x, hidden layer h = sigma(z^(1)), output y_hat = sigma(z^(2)), loss L = (y - y_hat)^2. Step 1: Forward pass (already know how to do this). z^(1) = W^(1) x + b^(1); h = sigma(z^(1)) z^(2) = W^(2) h + b^(2); y_hat = sigma(z^(2)) L = (y - y_hat)^2 Step 2: Output layer error signal (delta^(2)). dL/dy_hat = -2(y - y_hat) dy_hat/dz^(2) = sigma'(z^(2)) delta^(2) = dL/dz^(2) = (dL/dy_hat)(dy_hat/dz^(2)) = -2(y - y_hat) * sigma'(z^(2)) Step 3: Gradient with respect to W^(2). dL/dW^(2) = delta^(2) * h^T [outer product] Step 4: Backpropagate to layer 1. delta^(1) = (W^(2))^T delta^(2) * sigma'(z^(1)) [element-wise multiply by local activation derivative] Step 5: Gradient with respect to W^(1). dL/dW^(1) = delta^(1) * x^T Numerical example (continuing from Lesson 4 values, scalar output for simplicity): y = 1 (true label), y_hat = 0.670 (network output), sigma = sigmoid. Step 2: dL/dy_hat = -2(1 - 0.670) = -2(0.330) = -0.660 sigma'(z^(3)) = y_hat*(1-y_hat) = 0.670*0.330 = 0.221 delta^(3) = -0.660 * 0.221 = -0.146 Step 3: dL/dW^(3) = delta^(3) * h^(2)^T = -0.146 * [0.6, 0.14] = [-0.0876, -0.0204] With eta = 0.1, the update for W^(3) = [1.2, -0.8] becomes: W^(3)_new = [1.2 - 0.1*(-0.0876), -0.8 - 0.1*(-0.0204)] = [1.2 + 0.00876, -0.8 + 0.00204] = [1.20876, -0.79796] The weights shifted slightly toward values that increase the output — correctly, since we want the output closer to 1.
Why Backpropagation Is Efficient
Without backpropagation, computing dL/dw_i for each weight w_i would require re-running the entire forward pass twice (once with w_i, once with w_i + epsilon) — two forward passes per weight. With a million weights, that is two million forward passes per gradient step. Backpropagation computes all gradients in one backward pass, which costs roughly the same as one forward pass. This is a million-fold speedup in a million-parameter network. The reason is reuse. When backpropagating, each intermediate delta^(l) is computed once and reused for all weights in that layer. The chain rule products are accumulated rather than recomputed from scratch for each weight. Backpropagation was introduced in the context of neural networks by Rumelhart, Hinton, and Williams in 1986, in a landmark Nature paper. It made training deep networks computationally feasible and is still the algorithm running inside every modern deep learning framework.
Match each backpropagation step to what it computes.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Backpropagation multiplies many terms together via the chain rule. If the activation derivative sigma'(z) is small (as it is in saturation regions of sigmoid or tanh), then delta^(l) shrinks exponentially as l decreases. By the time the gradient reaches early layers, it may be so small that those layers barely update. This is the vanishing gradient problem from the activation function perspective — and it explains again why ReLU, with its gradient of 1 for positive activations, dramatically helps deep networks train.
During backpropagation, what does multiplying delta^(l+1) by (W^(l+1))^T achieve?
A deep network has 50 layers all using sigmoid activations. The derivative of sigmoid is at most 0.25. What happens to the gradient at layer 1, starting from a gradient of 1 at layer 50?
Backpropagation by Hand
- Step 1: Build the simplest possible network: one weight w, one bias b = 0, no hidden layer. The network computes y_hat = w * x.
- Step 2: Use x = 3, y = 6 (true value), and start with w = 1. Loss L = (y - y_hat)^2.
- Step 3: Compute y_hat = 1 * 3 = 3. Compute L = (6-3)^2 = 9.
- Step 4: Compute dL/dy_hat = -2(6-3) = -6.
- Step 5: Compute dy_hat/dw = x = 3 (since y_hat = w*x, the derivative with respect to w is x).
- Step 6: By chain rule: dL/dw = (dL/dy_hat)*(dy_hat/dw) = (-6)*(3) = -18.
- Step 7: Update: w_new = w - eta * (dL/dw) = 1 - 0.1*(-18) = 1 + 1.8 = 2.8.
- Step 8: Compute y_hat_new = 2.8 * 3 = 8.4. Loss_new = (6-8.4)^2 = 5.76. Did loss decrease?
- Step 9: Continue for two more steps. What value does w approach?