Why Go Deep
Imagine trying to recognize a human face using only a checklist of 100 pixel brightness values. You would struggle immediately, because a face is not just 100 numbers — it is eyes, a nose, a jawline, expressions, lighting conditions. The insight behind deep learning is that instead of hand-crafting features, you build a system that learns features automatically, layer by layer, from raw data. This lesson asks the foundational question: why does stacking many layers make that possible?
The Limits of Shallow Models
A shallow model — one with a single layer of adjustable parameters — computes a linear combination of its raw inputs, then applies one nonlinearity. Formally, if the input is a vector x of dimension d, a single-layer model computes f(x) = sigma(Wx + b), where W is a weight matrix, b is a bias vector, and sigma is an activation function. The problem is expressive power. A single linear transformation followed by one nonlinearity can only carve the input space into relatively simple regions. The classic mathematical result (the Universal Approximation Theorem) says that even a single hidden layer can represent any continuous function — but it may need an exponentially large number of neurons to do so. In practice, width alone does not scale. The parameters become unmanageable, and the model never generalizes from training data to new examples.
Representation learning is the process by which a model discovers, automatically, the internal features most useful for a task. Deep networks excel at this because each layer can build on the features learned by the layer below, progressively constructing more abstract and useful representations from raw input.
Consider image recognition step by step. The first layer of a deep convolutional network learns to detect edges — places where pixel brightness changes sharply. The second layer combines edges into corners and curves. The third combines those into object parts: eyes, wheels, handles. The fourth recognizes whole objects. No human programmed these features; they emerged from training because each level of abstraction is useful for the task. This compositionality is the key. If you need to represent N distinct features at each of L layers, a deep network needs roughly N * L parameters. A single-layer network trying to represent the same functions directly may need N^L parameters — an exponential blow-up. Depth is computationally efficient.
What Each Layer Learns
We can think of each layer as a learned transformation of its input. Let h^(l) denote the output (hidden state) of layer l. Then: h^(1) = sigma(W^(1) x + b^(1)) [first hidden layer] h^(2) = sigma(W^(2) h^(1) + b^(2)) [second hidden layer] ... y_hat = sigma(W^(L) h^(L-1) + b^(L)) [output layer] Each successive h^(l) is a new representation of the data — the same underlying example, but expressed in a space that makes the final prediction easier. The genius of backpropagation (covered in Lesson 7) is that all of these transformations are learned jointly, end-to-end, by optimizing a single loss function on the final output.
Flashcards — click each card to reveal the answer
Each layer is a learned lens that reframes the data. The raw input is transformed into progressively higher-level descriptions until the final layer can make a confident prediction. This is why 'deep' learning is not just marketing — the depth is load-bearing.
Why does a single-layer network face a practical problem even though the Universal Approximation Theorem says it can represent any function?
In a deep image-recognition network, what does it mean to say the second layer 'builds on' the first?
Representation Ladder
- Step 1: Choose a concept that has clear levels of abstraction — for example, written language (letters → words → sentences → paragraphs → essays) or music (notes → chords → phrases → sections → compositions).
- Step 2: Draw a four-level diagram where each level is labeled with what the features at that level represent.
- Step 3: For each transition between levels, write one sentence describing what 'combining' the lower level produces at the higher level.
- Step 4: Discuss: if you had to detect the highest-level concept (essay quality, musical style) directly from the lowest level (individual pixels of text, raw audio samples), what would be lost?