Architectures and Why They Differ
Everything you have studied so far — neurons, weights, biases, activations, loss, gradient descent, and backpropagation — applies to every neural network regardless of what it does. But the specific way you connect neurons into layers, and which layers you choose, varies enormously across tasks. A network that classifies images should not look the same as one that translates English to French. The structure of the data — how information is arranged spatially or temporally — should inform the structure of the network. This lesson surveys three major architectural families, explains the design principles behind each, and shows why choosing the right architecture for a task is as important as choosing the right training procedure.
Fully Connected Networks and Their Limits
The architecture you have been studying — every neuron in one layer connected to every neuron in the next — is called a fully connected or dense network. It makes no assumptions about the structure of the input. For images, this becomes a problem. A 224×224 color image has 224 × 224 × 3 = 150,528 input values. A first layer with 1,000 neurons would have 150,528 × 1,000 = 150.5 million weight parameters just in the first connection. Worse, a fully connected network treats each pixel as completely independent — but pixels that are spatially close together are highly related. A cat's ear is a local pattern; knowing that pixel (100, 100) is dark matters more in the context of the pixels around it than in relation to a pixel at (10, 200). A fully connected network learns spatial structure only by brute-forcing it through the data — if trained on enough examples, it will eventually discover that nearby pixels are related. But this is wildly inefficient compared to building that prior knowledge into the architecture.
An architectural prior is an assumption about data structure that is built into the network's design — before any training begins. CNNs encode the prior that patterns are spatially local and translatable. Recurrent networks encode the prior that data is sequential and context-dependent. Transformers encode the prior that any position can relate to any other. Choosing an architecture means choosing which priors to embed. Right priors dramatically reduce the amount of data needed to learn.
Convolutional Neural Networks (CNNs): For Spatial Data A convolutional layer applies a small grid of weights — called a filter or kernel — repeatedly across the input image. If the filter is 3×3, it looks at a 3×3 patch of pixels, computes a weighted sum, and produces one output value. It then slides one step and repeats, covering the entire image. This produces a 2D output called a feature map. The key insight is weight sharing: the same filter weights are used at every position in the image. A filter that detects vertical edges learns those weights once, then applies the same detector everywhere. This reduces the parameter count dramatically (a 3×3 filter has only 9 weights) and encodes translation invariance — the same feature is detected regardless of where in the image it appears. A CNN stacks multiple convolutional layers. Early layers detect low-level features (edges, colors, textures). Deeper layers combine these into high-level features (shapes, object parts). After several convolutional layers, a final fully connected layer produces the classification output. This architecture is why image recognition — once extremely difficult for computers — became practical after 2012, when the AlexNet CNN dramatically outperformed all prior methods on ImageNet. Recurrent Neural Networks (RNNs): For Sequential Data Text, audio, time-series sensor readings — these inputs have an inherent order. What came before matters for interpreting what comes now. A fully connected network treats each position independently; a recurrent network shares information across positions. At each step in a sequence, an RNN takes the current input and a hidden state vector — which encodes information from all previous steps — and produces a new hidden state. This is a loop in the computation graph: the network processes the sequence one element at a time, carrying a summary of its history forward. RNNs struggle with long-range dependencies — information from many steps back tends to vanish (the vanishing gradient problem again, here across time steps rather than layers). Long Short-Term Memory networks (LSTMs) and Gated Recurrent Units (GRUs) were designed to address this by adding explicit gating mechanisms that control what information is retained or discarded.
Transformers: Attention as Architecture
As of the mid-2020s, the transformer architecture dominates natural language processing and is rapidly displacing CNNs and RNNs in many other domains. Understanding why requires understanding the attention mechanism. The core problem with RNNs is their sequential nature: to process position 100 in a sequence, you must first process positions 1 through 99. This prevents parallelization and makes long-range dependencies difficult. Transformers discard sequential processing entirely. Instead, they use self-attention: every position in the sequence looks at every other position directly and computes a weighted sum of their representations, where the weights (attention scores) reflect relevance. For a sentence like 'The trophy did not fit in the suitcase because it was too big,' the word 'it' needs to be linked to 'trophy' — which may be many positions away. A transformer computes attention scores between 'it' and every other word and assigns a high score to 'trophy,' directly encoding that relationship in one step, regardless of distance. Transformers process all positions in parallel, making them faster to train than RNNs on modern hardware. They also scale remarkably well: larger transformers with more parameters trained on more data consistently improve. The large language models you may interact with — including the one generating these lessons — are transformers.
The transformer architecture was proposed in 2017. By 2023 it had displaced RNNs in language and was making inroads in vision (Vision Transformers, ViT), audio, biology, and more. Architecture research moves quickly, and what is dominant today may not be in five years. The underlying principles — priors, efficiency, gradient flow, representational capacity — remain stable even as specific architectures evolve. Study the principles; watch the architectures.
Prompt Challenge
Write a prompt asking an AI assistant to explain to a high-school student why a convolutional neural network is a better choice than a fully connected network for image classification. Your prompt should result in a clear, technically accurate explanation.
Your prompt should…
- Ask about convolutional networks for image classification
- Mention weight sharing or filters as part of the explanation
- Tell the assistant the audience is high school students
Why does weight sharing in a convolutional layer reduce the parameter count compared to a fully connected layer?
What fundamental advantage does the transformer's self-attention mechanism have over a recurrent network for processing sequences?
Choose an Architecture and Justify It
- For each of the three scenarios below, select which architecture family (fully connected, CNN, RNN/transformer) is most appropriate, and write two to three sentences justifying your choice based on the structure of the data.
- Scenario 1: Predicting whether a patient will develop a disease based on a fixed set of 30 blood test values. There is no spatial or temporal structure.
- Scenario 2: Classifying satellite images of land into categories (forest, urban, farmland, water). The images are 256×256 pixels.
- Scenario 3: Transcribing spoken audio to text. The input is a time series of sound measurements at 16,000 samples per second.
- For each scenario, also name one specific challenge you would anticipate (data size, length of sequence, class imbalance, etc.) and how the architecture choice helps or does not fully address it.