Skip to main content
Machine Learning & Deep Learning

⏱ About 15 min15 XP

Networks That See

Hold up any photograph and your brain instantly knows what is in it — a dog, a street, a birthday cake. That takes under a second. Teaching a computer to do the same thing was considered nearly impossible for decades. Then a class of deep networks called convolutional neural networks cracked the problem, and computer vision changed overnight.

The Problem With Raw Pixels

A digital image is just a grid of pixels. A small photo might be 224 pixels wide by 224 pixels tall, with three color channels (red, green, blue). That is 224 × 224 × 3 = 150,528 numbers per image. If you handed all those numbers directly to a basic neural network, two problems arise immediately. First, the network has no idea that nearby pixels are related — it treats pixel at position (10, 10) and pixel at position (11, 10) as completely unconnected. Second, if the cat in the photo moves one pixel to the right, the network sees a totally different input and might fail to recognize it. A convolutional neural network — CNN for short — solves both problems with a clever trick called a filter.

Definition: Convolutional Neural Network (CNN)

A CNN is a deep neural network designed for grid-like data such as images. It uses learnable filters that slide across the image, detecting local patterns (edges, textures, shapes) regardless of where in the image those patterns appear. This makes CNNs both efficient and location-flexible.

Picture a small 3×3 magnifying square sliding across the entire image, one step at a time. That square is a filter. As it moves, it multiplies the pixel values under it by a set of learned numbers and adds them up, producing a single output value. Slide the filter across the whole image and you get a new, smaller grid of values called a feature map. One filter might learn to detect horizontal edges. Another learns vertical edges. A third learns diagonal lines. Stack dozens of filters per layer, then stack many layers, and the network builds up from edges to textures to parts to objects — exactly the hierarchy described in Lesson 1. A famous CNN called AlexNet, trained on 1.2 million images in 2012, cut the error rate on a major image-recognition contest nearly in half compared with every previous method. That single result convinced the research world that deep learning was real.

Computer Vision in the Real World

CNNs now power applications you encounter every day. Medical imaging: CNNs scan X-rays and MRI scans to flag potential tumors, sometimes matching or exceeding specialist accuracy. Self-driving cars: cameras feed into CNNs that identify pedestrians, lane markings, traffic signs, and other vehicles in real time. Face unlock: your phone's front camera runs a CNN in milliseconds to confirm your identity. Content moderation: social platforms use CNNs to automatically detect and remove violent or abusive images before a human reviewer even sees them. Each of these uses the same core idea — filters sliding over pixels — scaled up with massive datasets and specialized hardware.

Seeing Is Not Understanding

A CNN that achieves 99% accuracy on a benchmark does not 'see' the way you do. It has learned statistical patterns in pixels. Change the lighting dramatically, flip the image, or add a barely visible layer of noise and accuracy can drop sharply. These weaknesses matter in safety-critical uses like medical diagnosis and autonomous driving.

Flashcards — click each card to reveal the answer

What does a CNN filter do as it slides across an image?

Why did AlexNet's 2012 result matter so much to the AI field?

Be the Filter

  1. Draw a 6×6 grid and fill each cell with a random number from 0 to 9 — this is your tiny 'image.'
  2. Draw a 2×2 filter grid and write these four weights in it: top-left=1, top-right=0, bottom-left=0, bottom-right=1.
  3. Slide your filter across the image, one step at a time, and compute the sum of (weight × pixel) for each position.
  4. Record your output values in a new, smaller grid — this is your feature map.
  5. What do you notice? Does the filter respond more strongly to certain parts of the image?