Sequence Models
Language, music, stock prices, protein sequences, and sensor readings all share a property that images do not: order matters. The sentence 'The dog bit the man' means something entirely different from 'The man bit the dog' — the same words, rearranged. A neural network that treats each word independently, ignoring position, cannot capture this. Sequence models are architectures designed to process ordered data and learn from the relationships between elements across time.
Recurrent Neural Networks
A recurrent neural network (RNN) processes a sequence one element at a time. At each step t, the network receives the current input x_t and its own previous output — called the hidden state h_{t-1} — and produces a new hidden state h_t. Formally: h_t = f(W_h · h_{t-1} + W_x · x_t + b) where W_h and W_x are learned weight matrices, b is a bias vector, and f is a nonlinear activation (typically tanh). The hidden state acts as a compressed memory of everything the network has seen so far in the sequence. This design is powerful in principle: a single set of weights is shared across every timestep (analogous to weight sharing in CNNs), and the hidden state can, in theory, carry information from the first word of a sentence to the last. After processing all T timesteps, a final layer typically maps h_T to a prediction — for example, classifying a sentence as positive or negative sentiment. For tasks requiring an output at every timestep — such as tagging each word in a sentence with its part of speech — the network produces y_t at each step alongside h_t. For tasks requiring a variable-length output — such as machine translation — an encoder RNN reads the entire source sentence and produces a context vector, then a decoder RNN uses that vector to generate the target sentence word by word. This encoder-decoder pattern was the dominant architecture for sequence-to-sequence tasks from roughly 2014 to 2017.
In an RNN, the hidden state h_t is the entire network's memory of the sequence so far. It must compress everything relevant about x_1, x_2, …, x_t into a fixed-size vector. This is powerful but creates a bottleneck: the longer the sequence, the harder it is to retain early information.
Two practical problems limit vanilla RNNs on long sequences. Vanishing gradients: During training, gradients are propagated backward through time using the chain rule. If the sequence is long, the gradient signal must pass through many repeated matrix multiplications. If the dominant eigenvalue of W_h is less than 1, the gradient shrinks exponentially with sequence length — effectively becoming zero before it reaches the early timesteps. The network cannot learn long-range dependencies because the training signal vanishes. Exploding gradients: The symmetric problem — if the dominant eigenvalue exceeds 1, gradients grow exponentially, causing numerical instability. Practitioners apply gradient clipping (capping the gradient norm at a threshold) as a workaround, but this does not solve vanishing gradients. LSTMs (Long Short-Term Memory networks, Hochreiter & Schmidhuber 1997) address the vanishing gradient problem through a gating mechanism. An LSTM maintains two state vectors: the hidden state h_t (same role as before) and a cell state c_t (a second, slower-changing memory lane). Three learned gates — input gate, forget gate, and output gate — control how much new information enters the cell, how much old information is forgotten, and how much of the cell state flows into the hidden state. Because the cell state can flow across many timesteps with minimal multiplicative distortion, gradients propagate more reliably through long sequences. GRUs (Gated Recurrent Units) offer a simplified two-gate variant that often performs comparably with fewer parameters. Despite these improvements, RNNs are fundamentally sequential: step t cannot be computed until step t-1 is finished. This prevents parallelism across timesteps, making training slow on modern GPU hardware designed for parallel computation. This limitation directly motivated the Transformer architecture.
Flashcards — click each card to reveal the answer
Limits That Motivated What Came Next
By 2017 LSTMs were the state of the art for language tasks, but researchers had identified two persistent weaknesses. The fixed-size bottleneck: In an encoder-decoder model, the entire source sentence must fit into the context vector — a vector of perhaps 512 numbers. Translating a 50-word sentence forces 50 words of meaning into this fixed container. Long sentences degrade in quality because some information is inevitably compressed away. Attention mechanisms (covered in the next lesson) were introduced specifically to break this bottleneck. Sequential computation: Because each step depends on the previous step's hidden state, you cannot compute step 5 until step 4 is done. For a sentence of length 512, that is 512 sequential operations. Modern GPUs execute thousands of operations simultaneously, so feeding them a chain of sequential dependencies underutilizes the hardware. The Transformer replaces sequential recurrence with parallel attention, fully leveraging GPU parallelism. Understanding these limitations matters: when you see a paper claim 'we replaced RNNs with Transformers,' you now know the two concrete problems — the bottleneck and the sequential constraint — that motivated the replacement.
For tasks with genuinely short sequences, strict resource constraints (edge devices), or online streaming requirements (processing one token at a time in real time), RNNs and LSTMs remain practical choices. A Transformer that processes a whole sequence at once cannot easily handle a stream of sensor readings arriving one-at-a-time without modification.
Why does the vanishing gradient problem specifically hurt an RNN's ability to learn long-range dependencies?
An LSTM processes a 200-word document. What structural feature allows it to carry relevant information from word 1 all the way to word 200 more reliably than a vanilla RNN?
Trace the Hidden State by Hand
- Step 1. Write a five-word sentence: 'The quick brown fox jumps.'
- Step 2. Model each word as a single number: assign The=1, quick=2, brown=3, fox=4, jumps=5.
- Step 3. Use a simple scalar RNN: h_t = 0.5 * h_{t-1} + 0.5 * x_t, starting with h_0 = 0.
- Step 4. Compute h_1 through h_5 by hand.
- Step 5. Notice how much influence the first word (The=1) has on h_5 compared with the last word (jumps=5).
- Step 6. Now repeat with a weight of 0.9 instead of 0.5 on the previous state. How does the influence of early words change?
- Step 7. Discuss: what does this tell you about how the choice of recurrence weight affects long-range memory?