Skip to main content
Robotics & Embodied AI

⏱ About 15 min15 XP

Cameras and Computer Vision

Of all the senses available to a robot, vision is simultaneously the richest and the most challenging. A single camera frame contains millions of numbers — yet most of them are unimportant for the task at hand. Computer vision is the field that develops algorithms to sift through those millions of numbers and extract what matters: where objects are, what they are, how they are moving, and what is happening in the scene.

How a Camera Works

A digital camera contains a sensor made of millions of tiny light-detecting elements called pixels (short for picture elements). Each pixel measures the intensity of light hitting it and records a number. Color cameras split incoming light into red, green, and blue channels, recording three numbers per pixel. A 12-megapixel camera therefore produces about 36 million numbers for every single image it captures. The lens focuses light onto the sensor, and the shutter controls how long the sensor is exposed. Longer exposure collects more light but blurs moving objects. Shorter exposure freezes motion but may produce a dark image. These tradeoffs are part of what makes vision engineering interesting — there is no single perfect setting for every situation.

Pixel Values

Each pixel holds a number from 0 (completely dark) to 255 (maximum brightness) for each color channel. An image is literally a grid of numbers — nothing more. All computer vision algorithms start from this grid.

From Pixels to Features: Edge Detection

The first step in most vision pipelines is finding edges — boundaries where pixel brightness changes sharply. Edges correspond to the outlines of objects, shadows, and surface markings. An algorithm called an edge detector slides a small mathematical filter across the image, computing how rapidly the brightness changes at each location. Where the change is large, an edge is marked; where it is gradual, nothing is marked. Edges are powerful because they remain recognizable across different lighting conditions. A chair silhouetted against a bright window looks very different in raw pixel values than the same chair in a dim room — but the edge pattern that defines the chair's shape is similar in both cases. This is why edge detection was one of the earliest successful computer vision techniques.

Object Detection and Recognition

Modern robots need to do more than find edges — they need to identify what objects are present and where exactly they are located in the image. Object detection algorithms draw bounding boxes around objects and label each box with a category such as 'person,' 'car,' or 'cup.' Recognition goes a step further by identifying the specific instance: not just 'face' but 'this particular person's face.' For decades, recognition relied on hand-engineered features like SIFT (Scale-Invariant Feature Transform), which identified distinctive keypoints in an image that remained stable under zoom and rotation. Since around 2012, deep learning — specifically convolutional neural networks (CNNs) — has become the dominant approach. A CNN learns its own features from millions of labeled training images, producing recognition accuracy that surpassed human performance on some benchmarks by 2015.

Convolutional Neural Network (CNN)

A convolutional neural network is a type of deep learning model designed for image data. It applies learned filters to the image to detect increasingly abstract features: edges in early layers, shapes in middle layers, and objects in final layers.

Match each computer vision concept to its description.

Terms

Pixel
Edge detection
Object detection
Convolutional neural network
Bounding box

Definitions

The smallest unit of an image, storing a brightness value for each color channel
Drawing labeled bounding boxes around every object instance in a scene
A rectangle drawn around a detected object that describes its location in the image
A deep learning model that learns hierarchical image features automatically from training data
Finding locations in an image where brightness changes sharply, marking object boundaries

Drag terms onto their definitions, or click a term then click a definition to match.

Challenges in Robot Vision

Real-world robot vision is harder than benchmark tests suggest. Lighting changes dramatically — a well-lit factory floor looks nothing like the same floor under emergency lighting. Objects can overlap, hiding parts of each other in what is called occlusion. Fast motion blurs images. Reflective surfaces like glass produce confusing mirror images. Outdoor robots must cope with rain, direct sunlight glare, and fog. Robots also need to run vision algorithms in real time — typically at 15 to 60 frames per second. This places strict limits on how computationally expensive the algorithms can be. Dedicated hardware accelerators called GPUs (Graphics Processing Units) and specialized chips called neural processing units (NPUs) make it possible to run powerful CNN-based detection at those speeds on a robot's onboard computer.

What does an edge detector find in an image?

What advantage do convolutional neural networks have over hand-engineered features for object recognition?

A digital camera captures an image as a grid of , each storing a brightness value. Finding sharp transitions in that grid is called detection. Modern robots use neural networks to learn features and identify objects automatically.

Be the Edge Detector

  1. Step 1: Draw a simple 5x5 grid on paper and fill in pixel brightness values (use numbers 0-10) for a simple scene: a dark square on a bright background.
  2. Step 2: Slide a 3x3 window across your grid. At each position, compute the difference between the leftmost column average and the rightmost column average.
  3. Step 3: Write your difference values into a new 5x5 grid. Where are the large values? Do they line up with the edge of your dark square?
  4. Step 4: Explain in one sentence why high difference values indicate an edge.
  5. Step 5: Describe one real-world situation where edge detection might fail and explain why.