Skip to main content
AI Foundations

⏱ About 20 min20 XP

The Forward Pass

You now have all the ingredients for a single neuron: a weighted sum, a bias, and a nonlinear activation. A full neural network is simply many neurons arranged into layers, with each layer's outputs feeding the next layer's inputs. The forward pass is the process of computing the network's output from a given input — data flows forward through every layer, each layer transforming the representation it receives, until the final layer produces a prediction. Understanding the forward pass precisely is essential before you can understand how the network learns, because learning is the process of adjusting parameters so that the forward pass produces better predictions.

Layer by Layer: A Worked Example

Consider a small network: Input layer: 2 inputs, x = [1.0, –0.5] Hidden layer: 2 neurons, each receiving both inputs Output layer: 1 neuron, receiving both hidden-layer outputs Hidden neuron 1 has weights [0.6, 0.4] and bias 0.1: z₁ = 0.6(1.0) + 0.4(–0.5) + 0.1 = 0.60 – 0.20 + 0.10 = 0.50 a₁ = ReLU(0.50) = 0.50 Hidden neuron 2 has weights [–0.3, 0.8] and bias –0.2: z₂ = –0.3(1.0) + 0.8(–0.5) + (–0.2) = –0.30 – 0.40 – 0.20 = –0.90 a₂ = ReLU(–0.90) = 0.00 (ReLU clips negative values to zero) Hidden layer output vector: a = [0.50, 0.00] Output neuron has weights [1.2, –0.7] and bias 0.3: z_out = 1.2(0.50) + (–0.7)(0.00) + 0.30 = 0.60 + 0.00 + 0.30 = 0.90 output = 0.90 (assuming identity activation for a regression output) The network has transformed the input [1.0, –0.5] into the output 0.90 through two sequential layers of computation. Notice that hidden neuron 2 fired with activation 0.00 — it contributed nothing to this particular input. Different inputs will activate different subsets of neurons, which is part of how the network stores varied pattern information.

The Forward Pass as Function Composition

Mathematically, the forward pass is function composition. Each layer applies a function to its input (a linear transformation followed by an element-wise nonlinearity), and the output of each layer is the input to the next. The full network computes: output = fₙ(... f₂(f₁(x)) ...). The depth of this composition is what gives the network its representational power.

The matrix view makes the layer computation concise. For a layer with m neurons each receiving n inputs, collect all weights into an m × n matrix W and all biases into an m-dimensional vector b. The entire layer's pre-activations are: z = Wx + b This is a single matrix-vector multiplication — one operation on modern hardware, regardless of how many neurons the layer contains. After computing z, apply the activation function element-wise to every entry: a = f(z) (element-wise) The result a is the layer's output vector, and it becomes x for the next layer. This repeats until the last layer. Why does this matter? Because modern graphics processing units (GPUs) and specialized AI chips (TPUs, etc.) are designed to perform matrix multiplications extremely fast. Expressing the forward pass as a sequence of matrix operations is not just notation — it is the reason training a billion-parameter network is even feasible. Hardware acceleration of matrix math is the engine under the hood of the AI revolution.

What Each Layer Represents

An important question: what does an intermediate layer's activation vector actually mean? In a network trained to classify images of animals, early layers might detect edges and colors; middle layers detect shapes and textures; later layers detect ears, eyes, and fur patterns. This hierarchy is not programmed — it emerges from training. Each layer learns to build on the previous layer's representation. For inputs far from images (text, numerical sensor data), the same principle applies: earlier layers extract simple patterns, later layers combine them into richer ones. The representation at any intermediate layer is sometimes called an embedding or a latent representation — it is the network's internal encoding of the input. You cannot read these intermediate activations the way you read plain language; they are high-dimensional vectors with no obvious human interpretation. Researchers use techniques like probing (training small classifiers on intermediate activations) and visualization (for image networks) to understand what is encoded at each layer. This interpretability challenge is one of the central open problems in AI research.

The Forward Pass Is Deterministic (Once Trained)

Given the same input and the same weights, a forward pass always produces the same output. There is no randomness in inference. Randomness enters during training — in how weights are initialized, in which data examples are shown each step, and in certain regularization techniques. Understanding this distinction matters when you reason about reliability and debugging neural network behavior.

Flashcards — click each card to reveal the answer

In the worked example, hidden neuron 2 produced an activation of 0.00. What caused this?

Why is expressing a layer as the matrix operation z = Wx + b important beyond just notation?

Forward Pass by Hand

  1. Design a three-layer network: 2 inputs, 3 hidden neurons, 1 output.
  2. Make up your own weights and biases (use simple numbers like 0.5, –0.3, 1.0).
  3. Choose any input vector x = [x₁, x₂].
  4. Compute the full forward pass:
  5. Step 1: Compute z for each of the 3 hidden neurons. Apply ReLU to each. Record the hidden activation vector a.
  6. Step 2: Feed a into the output neuron. Compute its z and apply an identity activation.
  7. Write out every multiplication and addition explicitly — no calculator shortcuts.
  8. Experiment: change one weight and re-run. How much did the output change? Which weight had the biggest effect, and why?