Module Check: Deep Neural Networks, In Depth
You have covered the full machinery of deep neural networks — from the motivation for depth, through the precise mathematics of neurons, forward passes, and loss functions, to the algorithms of gradient descent and backpropagation, and the practical craft of regularization and training. This module check consolidates and tests that knowledge. Work through each section carefully. The questions range from precise recall to multi-step reasoning. The capstone activity asks you to synthesize everything into a design decision.
Flashcards — click each card to reveal the answer
A neuron has weights [1.0, -2.0], inputs [0.5, 1.5], and bias 0.5. It uses ReLU. What is its activation?
You are training a regression model (predicting house prices). Which loss function is appropriate, and why is cross-entropy unsuitable?
During backpropagation, the error signal at the output layer is delta^(L) = 0.5. The weight connecting neuron k in layer L-1 to neuron j in layer L is W = -0.8. The activation of neuron k is h_{k} = 1.2. What is the gradient of the loss with respect to W?
A model achieves 0.01 training loss and 1.8 validation loss. Adding L2 weight decay changes this to 0.05 training loss and 0.4 validation loss. What happened, and is this a good trade?
Why does mini-batch gradient descent sometimes converge to better solutions than full-batch gradient descent on the same problem?
You remove all activation functions from a 10-layer network, keeping only weights and biases. What is the effective depth of the resulting model?
Deep networks are powerful because depth enables compositional representation learning — each layer builds on the last. A neuron computes z = w^T x + b, then a = sigma(z); without sigma, all layers collapse to one. The forward pass propagates input through chained matrix multiplications and activations. Training minimizes a loss function (MSE for regression, cross-entropy for classification) by gradient descent, which requires the gradient dL/dw for every weight. Backpropagation computes all gradients in one efficient backward pass using the chain rule. Deep networks overfit because they have enough capacity to memorize training data; dropout, weight decay, and early stopping counteract this by imposing inductive biases toward simpler, more generalizable solutions. Together, these mechanisms — architecture, loss, optimization, regularization — constitute the complete deep learning training framework.
Design a Network for a New Problem
- Step 1: You are building a neural network to classify audio clips as one of 5 sound categories (alarm, music, speech, nature, silence). Each clip is represented by 200 numerical features extracted from the audio waveform.
- Step 2: Specify your network architecture: number of hidden layers, neurons per layer, activation function for hidden layers, activation function for the output layer. Write the shape of each weight matrix W^(l).
- Step 3: Choose a loss function and justify your choice based on the nature of the target.
- Step 4: Specify your optimizer settings: batch size, initial learning rate.
- Step 5: Specify two regularization techniques you will use and their hyperparameter values.
- Step 6: Describe what a healthy training curve would look like at epoch 1, epoch 10, and epoch 30. What loss values would concern you?
- Step 7: Describe one specific failure mode (with its symptoms in the training log) that you would watch for, and what you would do to fix it.