Gradient Descent and the Learning Rate
You now have a loss function — a number measuring how wrong the network is. The goal of training is to minimize this number by adjusting the weights. But a network may have millions of weights. You cannot try every possible combination. Gradient descent is the algorithm that makes minimization tractable: it uses the mathematical structure of the loss surface to find a direction that reduces loss, then takes a small step in that direction. Repeat millions of times, and the weights converge to values that produce accurate predictions.
The Gradient as a Direction
The gradient of a function f(theta) with respect to its parameters theta is a vector of partial derivatives — one entry for each parameter. The partial derivative df/dw tells you the rate at which the loss changes when you increase w slightly, holding everything else fixed. Critical fact: the gradient points in the direction of steepest ascent (increase) of the loss. Moving in the opposite direction — negative gradient — is the direction of steepest descent. Simple one-dimensional example: suppose the loss as a function of a single weight w is: L(w) = (w - 3)^2 This is minimized at w = 3 (loss = 0). Suppose we start at w = 7. Then: dL/dw = 2(w - 3) = 2(7 - 3) = 8 The gradient is 8, pointing rightward (increasing w). To decrease loss, we move left — decrease w. We update: w_new = w - eta * (dL/dw) = 7 - eta * 8 where eta (eta) is the learning rate, a positive scalar controlling step size. If eta = 0.1: w_new = 7 - 0.1 * 8 = 7 - 0.8 = 6.2 The loss at w=6.2: (6.2-3)^2 = 10.24. Previously (at w=7): (7-3)^2 = 16. Loss decreased from 16 to 10.24 in one step. Next step: dL/dw at w=6.2 is 2(6.2-3) = 6.4. w_new = 6.2 - 0.1*6.4 = 5.56. And so on — each step moves w closer to 3.
At each training step, compute the gradient of the loss with respect to every parameter. Update every parameter by subtracting a small multiple of its gradient: theta_new = theta_old - eta * (d L / d theta) This is the gradient descent update. The learning rate eta is a hyperparameter — set by the practitioner before training, not learned. Every weight and bias in the network is updated this way simultaneously.
The learning rate eta is one of the most important hyperparameters in training. Its effect: Too large (e.g., eta = 2.0): In our example, w_new = 7 - 2.0*8 = 7 - 16 = -9. Now loss = (-9-3)^2 = 144 — worse than we started. A very large learning rate causes the updates to overshoot the minimum, bouncing between large values and possibly diverging (loss grows without bound). Too small (e.g., eta = 0.0001): w_new = 7 - 0.0001*8 = 6.9992. Progress is negligible. Training would require millions of steps to converge. For large datasets, this means impractically long training times. Just right: eta = 0.1 in our example produces reasonable progress each step. In practice, values like 0.01, 0.001, or 0.0001 are common starting points for deep networks, and learning rate schedulers often reduce eta automatically during training as the model approaches convergence.
Stochastic and Mini-Batch Gradient Descent
Full-batch gradient descent computes the gradient over the entire training set before taking one step. For millions of examples, this is computationally prohibitive. Stochastic gradient descent (SGD) takes one training example at a time, computing a noisy estimate of the gradient. Each step is cheap but noisy — the estimated gradient varies a lot from example to example. Mini-batch gradient descent is the practical compromise: compute the gradient on a small batch (typically 32 to 512 examples), then update. The batch estimate is less noisy than single-example SGD, and the computation is far cheaper than full-batch. When practitioners say 'SGD,' they almost always mean mini-batch SGD. The noise in mini-batch gradients is not purely a drawback — it acts as a form of regularization, helping the model escape sharp local minima and find flatter, more generalizable solutions.
Flashcards — click each card to reveal the answer
In high dimensions, the loss surface is complex. Pure local minima (where every direction goes up) are actually rare — most problematic flat regions are saddle points, where some directions go up and some go down. Gradient descent can slow down near saddle points, but the stochasticity of mini-batch training often helps escape them. True local minima that are significantly worse than the global minimum appear to be uncommon for large overparameterized networks.
A network has a parameter w = 5. The gradient dL/dw = -3. With learning rate eta = 0.1, what is w after one gradient descent step, and in which direction did w move?
Why do practitioners use mini-batches instead of computing the gradient over the full training set?
Learning Rate Exploration
- Step 1: Consider the loss function L(w) = (w - 2)^2 with starting weight w_0 = 8.
- Step 2: Using learning rate eta = 0.5, compute w after 4 gradient descent steps. Record the loss at each step. Does it converge?
- Step 3: Repeat with eta = 1.0. What happens?
- Step 4: Repeat with eta = 0.1. How many more steps would you need to reach w ≈ 2?
- Step 5: Sketch a graph of loss versus step for each learning rate. Label which is 'diverging,' which is 'oscillating,' and which is 'converging steadily.'