Skip to main content
Machine Learning & Deep Learning

⏱ About 20 min20 XP

Convolutional Networks for Vision

Imagine trying to teach someone to recognize a cat by handing them a list of six million numbers — the raw pixel values of one photograph. A fully connected neural network does essentially this: it treats each pixel as an independent input and learns a weight connecting every pixel to every neuron. For a 1000×1000 image with three color channels, that first layer alone contains three billion weights. Training becomes computationally intractable, and worse, the network has no built-in understanding that nearby pixels are related. Convolutional neural networks (CNNs) solve both problems elegantly.

Convolution: Sliding a Filter Across an Image

A convolution is a mathematical operation that applies a small grid of numbers — called a filter or kernel — to every local patch of an image. Concretely, imagine a 3×3 filter: a grid of nine learned weights. The network places this filter over the top-left 3×3 patch of the image, multiplies each filter weight by the corresponding pixel value, sums all nine products, and writes the result into an output grid. It then slides the filter one step to the right and repeats — across the entire image, from left to right, top to bottom. The resulting output grid is called a feature map. If the filter learned to detect vertical edges, the feature map will have high values wherever vertical edges appear in the image and low values elsewhere. One convolutional layer typically applies dozens of different filters in parallel, producing dozens of feature maps — each one highlighting a different low-level pattern such as horizontal edges, diagonal lines, or color gradients. The critical insight: a 3×3 filter has only nine weights no matter how large the image is. The same filter slides across the whole image (this is called weight sharing), so the parameter count is tiny compared with a fully connected approach. A filter that learns to detect edges will detect them wherever they appear — top-left corner or bottom-right — because the same weights are applied everywhere. This property is called translational equivariance.

Convolution in One Sentence

A convolutional layer replaces the question 'what is pixel (i, j) doing?' with 'what pattern is present in the neighborhood around position (i, j)?' — and it asks the same question everywhere in the image using the same small set of learned weights.

After each convolutional layer, networks typically apply a nonlinear activation function — almost universally ReLU (Rectified Linear Unit), which replaces every negative value with zero. Without nonlinearity, stacking layers would collapse to a single linear transformation no more powerful than one layer. Pooling layers then downsample the feature maps. Max pooling, the most common variant, divides the feature map into small non-overlapping windows (usually 2×2) and keeps only the maximum value in each window. This halves the spatial dimensions, reducing computation and making the representation somewhat insensitive to small shifts in position — a property called translational invariance. A typical CNN architecture alternates: [Conv → ReLU → Conv → ReLU → MaxPool] repeated several times, gradually increasing the number of filters while shrinking spatial size. Deep in the network, the feature maps represent abstract concepts — 'this region looks like a wheel' rather than 'this region has a horizontal edge.' The final feature maps are flattened into a vector and fed into one or more fully connected layers that produce the classification output.

Match each term to its definition.

Terms

Filter (kernel)
Feature map
Weight sharing
Max pooling
ReLU

Definitions

The output grid produced by applying one filter to an entire image
Using the same filter weights at every spatial position
Activation function that replaces negative values with zero
A small grid of learned weights slid across the input to detect a pattern
Keeping only the largest value in each small spatial window

Drag terms onto their definitions, or click a term then click a definition to match.

Why CNNs Suit Images

Three structural properties of natural images align perfectly with how CNNs work. Locality: meaningful visual patterns — edges, textures, corners — are local. A cat's whisker is defined by a small patch of pixels, not by a relationship between pixels in opposite corners. Convolutional filters exploit this by looking only at local neighborhoods. Translation invariance: a cat is a cat whether it appears in the upper-left or lower-right of the frame. Weight sharing plus pooling gives CNNs approximate invariance to such shifts. Hierarchy of features: low-level filters detect edges; mid-level combinations of edges detect shapes; high-level combinations of shapes detect object parts. Stacking convolutional layers naturally builds this hierarchy. The famous AlexNet (2012) demonstrated that deep CNNs could dramatically outperform all prior computer vision methods on the ImageNet benchmark — 1.2 million images, 1000 categories — kicking off the modern deep learning era. CNNs are not magic. They require large labeled datasets to train from scratch, they can fail when images are rotated or scaled in ways not covered by training data, and they offer little interpretability. But for supervised image classification, detection, and segmentation they remain the dominant approach or serve as the backbone for more complex systems.

CNNs Are Not General Learners

A CNN trained on photographs of animals has no knowledge of X-rays, satellite images, or histology slides. Locality and hierarchy are inductive biases that match natural photographs well, but applying a pretrained CNN to a radically different image domain without adaptation often yields poor results.

Why does weight sharing in a convolutional layer dramatically reduce the number of parameters compared with a fully connected layer?

A CNN is trained on daytime street photographs and then deployed to classify nighttime infrared images. What is the most likely failure mode?

Design a Filter by Hand

  1. Step 1. Draw a 5×5 grid on paper representing a tiny grayscale image. Fill the cells with numbers 0-9 representing pixel brightness.
  2. Step 2. Below it, draw a 3×3 grid and write the following weights: top row [−1, −1, −1], middle row [0, 0, 0], bottom row [1, 1, 1]. This is a horizontal-edge detector (Sobel filter variant).
  3. Step 3. Apply the filter manually to the top-left 3×3 patch: multiply each filter weight by the matching pixel, sum the nine products, write the result.
  4. Step 4. Slide one column right and repeat.
  5. Step 5. Where are the large positive values in your output? Where are the large negative values? What do they correspond to in your image?
  6. Step 6. Discuss: what would you change in the filter to detect vertical edges instead?