Training at Scale
The gap between a textbook neural network and a modern large language model is not primarily one of algorithmic novelty. The core ideas — layers, gradients, backpropagation — have been understood since the late 1980s. The gap is one of scale: more data, more parameters, and more compute, organized carefully so that training remains numerically stable, financially feasible, and completed in weeks rather than decades. This lesson is about the engineering and mathematics that make scale possible.
Data at Scale
Modern large models are trained on datasets of staggering size. GPT-3 trained on roughly 300 billion tokens (a token is approximately 0.75 words). LLaMA 2 used two trillion tokens. These datasets are assembled from web crawls (Common Crawl contains petabytes of text scraped from the public internet), books, code repositories, and curated sources. Raw web data is noisy. Preprocessing pipelines apply language identification (keeping English or target languages), deduplication (removing near-duplicate documents that would cause the model to memorize rather than generalize), quality filtering (removing spam, gibberish, and adult content), and tokenization (splitting text into subword units using algorithms like Byte-Pair Encoding, which balances vocabulary size against coverage of rare words). For images, datasets such as LAION-5B contain five billion image-text pairs, also scraped from the web. Medical AI datasets require careful curation, IRB approval, and de-identification. The quality and composition of the training data shapes model behavior at least as much as architectural choices — a fact sometimes called the 'data-centric AI' perspective. Data pipelines must be able to stream data faster than the GPU can consume it. A single A100 GPU can perform roughly 312 trillion floating-point operations per second. A slow disk or network becomes the bottleneck almost immediately. Production training stacks use high-bandwidth storage (NVMe SSDs, parallel file systems), prefetching, and careful interleaving of data loading and compute.
A model trained on one trillion tokens of carefully filtered text often outperforms a model trained on two trillion tokens of raw, unfiltered data. More data is not always better — the distribution, cleanliness, and diversity of the data matter as much as the volume.
Hardware: GPUs and TPUs are the engines of deep learning. A GPU (Graphics Processing Unit) contains thousands of simple processing cores optimized for the parallel floating-point arithmetic that matrix multiplication requires. NVIDIA's A100 and H100 GPUs have become the standard training accelerators; Google's TPUs (Tensor Processing Units) are custom ASICs designed specifically for neural network matrix operations. Precision matters for performance. Full 32-bit floating point (FP32) provides high numerical precision but uses more memory and is slower. Mixed-precision training uses 16-bit floats (FP16 or BF16) for most operations, falling back to FP32 for accumulation and loss scaling. This roughly doubles throughput and halves memory usage with negligible accuracy loss. Batch size determines how many training examples are processed simultaneously. Large batches parallelize well on GPUs but can destabilize training by producing gradients that are too accurate (low noise), potentially harming generalization. Gradient accumulation allows effectively large batches by summing gradients across several forward-backward passes before updating weights — useful when the physical memory fits only a small batch. Distributed training: a single GPU cannot hold a 70-billion-parameter model's weights, gradients, and optimizer state (Adam optimizer stores two additional vectors per parameter). Three parallelism strategies address this. Data parallelism: each GPU holds a complete model copy and processes a different batch slice. Gradients are averaged across GPUs after each step. This scales to hundreds of GPUs for models that fit on one GPU. Model parallelism (tensor parallelism): the weight matrices of each layer are split across GPUs. Each GPU computes part of each matrix multiplication, then GPUs communicate the partial results. This requires tight GPU interconnects (NVLink, Infiniband) for low latency. Pipeline parallelism: different layers of the model live on different GPUs. A micro-batch flows through GPU 1 (layers 1-10), then GPU 2 (layers 11-20), and so on in a pipeline. GPUs can process different micro-batches simultaneously, but care is needed to avoid 'pipeline bubbles' (idle time while waiting for upstream results). In practice, large model training combines all three — a technique called 3D parallelism.
Match each distributed training strategy to what it divides across GPUs.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Stability and the Learning Rate Schedule
Training large models is numerically fragile. Loss spikes — sudden large increases in training loss — occur and must be handled without discarding days of computation. Practitioners monitor training loss curves closely and use checkpoint averaging or gradient clipping to stabilize. The learning rate — how large a step the optimizer takes after each gradient update — is critical. Too high and weights overshoot the optimum and diverge; too low and training is glacially slow. Modern practice uses a learning rate warmup: for the first few thousand steps, the learning rate increases linearly from near zero to a target value. This prevents early instability when model weights are random. After warmup, a cosine decay schedule gradually reduces the learning rate over the rest of training, allowing fine-grained convergence near the end. The AdamW optimizer dominates large-model training. AdamW is Adam (Adaptive Moment Estimation) with decoupled weight decay. Adam maintains per-parameter running averages of the gradient (first moment) and the squared gradient (second moment), using them to normalize each parameter's update. This adapts the effective learning rate to each parameter individually: parameters with consistently large gradients receive smaller updates, preventing them from dominating. Weight decay (L2 regularization) is added separately to prevent parameters from growing unboundedly. Checkpointing is saving the model weights, optimizer state, and training position periodically (every few hundred steps in large training runs). If a training job crashes — a common occurrence over multi-week runs — training resumes from the latest checkpoint rather than starting over.
A training loss that decreases smoothly indicates healthy training. A loss that spikes, plateaus early, or oscillates usually indicates a learning rate problem, data issue, or numerical instability — not a fundamental problem with the architecture. Practitioners treat the loss curve as the vital signs of the training run.
Why does mixed-precision training use 32-bit floats for gradient accumulation even when forward passes use 16-bit?
A training run on 512 GPUs using data parallelism must communicate gradients after every step. Why is the speed of the GPU interconnect (NVLink, Infiniband) critical here?
Back-of-Envelope: Model Memory
- Step 1. GPT-3 has 175 billion parameters. Each parameter stored in 16-bit float takes 2 bytes.
- Step 2. Calculate how many gigabytes of memory the parameters alone occupy.
- Step 3. AdamW stores two additional vectors per parameter (first and second moment estimates), each in 32-bit float (4 bytes each). Calculate the additional memory for optimizer state.
- Step 4. Add parameter memory and optimizer state memory. How many A100 GPUs (80 GB each) are needed just to hold these?
- Step 5. In practice, activations and intermediate computations also consume memory. Discuss: why does training a 175-billion-parameter model require not just many GPUs but many GPUs with fast interconnects?