Activation Functions
You now know how a neuron computes a weighted sum z = w · x + b. But if the neuron simply output z directly, a stack of such neurons would still only compute a linear function — and the XOR problem showed us that linear functions are fundamentally limited. The activation function is the component that breaks linearity. It is a deliberate mathematical nonlinearity applied to z before the output is passed to the next layer. Without it, no matter how many layers you stack, the entire network collapses to the equivalent of a single layer. With it, depth becomes genuinely powerful.
The Need for Nonlinearity
Here is the core mathematical argument. Suppose every neuron simply computes f(z) = z (the identity activation). Then a two-layer network with weight matrices W₁ and W₂ computes: output = W₂ · (W₁ · x) = (W₂W₁) · x The product W₂W₁ is just another matrix — call it W. So the two-layer network computes W · x, exactly what a single-layer network computes. You gained nothing from adding the second layer. This argument extends: n layers of linear transformations compose into one linear transformation. Depth with linear activations is redundant depth. The network cannot learn anything more complex than a single-layer linear network, no matter how many layers you add. A nonlinear activation breaks this collapse. When f is nonlinear, W₂ · f(W₁ · x) cannot in general be rewritten as a single linear transformation. The composition of linear-then-nonlinear-then-linear-then-nonlinear creates a genuinely more expressive function class.
A neural network with even a single hidden layer and a nonlinear activation function can approximate any continuous function to arbitrary precision — given enough neurons. This result (proved in various forms by Cybenko, Hornik, and others in the late 1980s) is called the universal approximation theorem. It tells you that nonlinear activations are not just helpful — they are the ingredient that makes the network a general-purpose function approximator. The theorem says such a network exists; it does not say how to find it or how large it needs to be.
Historically, the sigmoid function σ(z) = 1 / (1 + e^(–z)) was the standard activation. It is smooth, differentiable everywhere, and outputs a value between 0 and 1 — naturally interpretable as a probability. For z = 0, σ = 0.5. For large positive z, σ approaches 1. For large negative z, σ approaches 0. Concrete computation: σ(2.0) = 1 / (1 + e^(–2)) = 1 / (1 + 0.135) ≈ 0.880. σ(–1.5) = 1 / (1 + e^(1.5)) = 1 / (1 + 4.48) ≈ 0.182. Sigmoid has a significant practical problem: when |z| is large, the function is nearly flat — its slope approaches zero. During training, the algorithm adjusts weights by consulting the slope of the activation function. Near-zero slopes mean near-zero updates — the network stops learning. This is called the vanishing gradient problem, and it severely hampered training of deep networks throughout the 1990s and early 2000s.
ReLU: The Simple Fix That Changed Everything
The Rectified Linear Unit (ReLU) is defined simply as: ReLU(z) = max(0, z) If z is positive, output z unchanged. If z is zero or negative, output 0. Concrete examples: ReLU(3.7) = 3.7 ReLU(0.0) = 0.0 ReLU(–2.1) = 0.0 ReLU(0.01) = 0.01 ReLU's slope is 1 for all positive z and 0 for all negative z. For the positive region, the slope never shrinks — gradients propagate through positive-activation neurons without diminishing. This largely eliminates the vanishing gradient problem in the positive region, making very deep networks trainable. ReLU is not perfect. When z is negative, both the output and the slope are zero — the neuron is completely 'dead' to that input. If a neuron's pre-activation is consistently negative across all training examples, it never updates. This is known as the dying ReLU problem. Variants such as Leaky ReLU (which allows a small slope for negative z) and ELU address this, though ReLU remains the default for hidden layers in most architectures.
The activation function for the output layer is chosen based on what kind of answer you need. For binary classification (yes/no), sigmoid gives a value between 0 and 1 interpretable as a probability. For multi-class classification (which of 10 categories?), softmax normalizes a vector of values into a probability distribution. For regression (predicting a continuous number), the output layer often has no activation at all — the raw linear output is the prediction. Choosing the right output activation is part of designing a network, not an afterthought.
Match each activation function to its defining property.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Why does stacking layers with linear activations fail to improve a network's expressiveness?
A neuron computes z = –0.4. If its activation function is ReLU, what is its output?
Trace Activation Functions on a Number Line
- Draw a horizontal number line from –3 to 3. Mark integer values.
- For each of the five z values below, compute and record the output of BOTH sigmoid and ReLU:
- z = –2.5, z = –1.0, z = 0.0, z = 1.0, z = 2.5
- For sigmoid, use the approximations: σ(–2.5) ≈ 0.076, σ(–1.0) ≈ 0.269, σ(0.0) = 0.500, σ(1.0) ≈ 0.731, σ(2.5) ≈ 0.924.
- Plot the five (z, sigmoid output) pairs and connect them — you should see an S-curve.
- Plot the five (z, ReLU output) pairs and connect them — you should see a hockey-stick shape.
- Now estimate the slope of each function between z = 2.0 and z = 2.5. Which function has a steeper slope in that region? What does this imply for gradient-based learning in deep networks?