Skip to main content
Robotics & Embodied AI

⏱ About 20 min20 XP

Learning from Data

A robot navigating a warehouse must recognize hundreds of product types, estimate where each one sits in three-dimensional space, and predict whether a given grasp will succeed before committing to it. None of these tasks can be specified with explicit rules. All three are learned from data. Supervised learning — the branch of machine learning in which a model is trained on labeled input-output pairs — is the engine behind most robot perception systems operating in the world today. This lesson develops supervised learning precisely as it applies to robotics, with attention to the data pipelines, model architectures, and failure modes specific to physical robots.

Supervised Learning: The Formal Setup

In supervised learning, we have a dataset D = {(x1, y1), (x2, y2), ..., (xN, yN)} where each xi is an input observation and each yi is the corresponding label or target value. The goal is to learn a function f such that f(x) approximates y for new inputs x not in D. For a robot, the inputs might be: - A 640x480 RGB image from a wrist-mounted camera - A point cloud of 10,000 3D measurements from a LiDAR sensor - A 6-axis force-torque reading at the robot's wrist - A concatenated vector of joint angles and joint velocities The output targets might be: - The class label of the object in the image (classification) - The 6-DoF pose of an object (position plus orientation in 3D space) (regression) - The probability that a proposed grasp will succeed (binary classification) - The next joint-velocity commands to track a trajectory (regression) The critical move in applying supervised learning to robotics is the translation: you must define precisely what the input representation will be, what the output target means, how labels will be obtained, and what loss function will measure prediction quality. Errors in this translation — choosing the wrong input representation, using noisy or systematically biased labels — propagate through the learned model and cause failures that are very difficult to diagnose after the fact.

Representation Is Everything

The choice of input representation often matters more than the choice of model architecture. A point cloud and an RGB image of the same scene contain different information and require different inductive biases. Before choosing a model, ask: does my representation contain the information needed to predict my output? If the information is not in the input, no model can extract it.

How Robots Obtain Labeled Data

Data collection is the most expensive and robot-specific part of the supervised learning pipeline. Three dominant strategies exist. Strategy 1 — Human annotation of recorded observations. A robot records sensor streams (camera, LiDAR, force-torque) while a human teleoperates it through a task. Humans then annotate the recordings: drawing bounding boxes around objects, labeling grasp outcomes as success or failure, marking where the robot's gripper was in each frame. This is high-quality but labor-intensive. Companies like Boston Dynamics and large warehouse robotics firms employ teams of annotators. Strategy 2 — Simulation-based labeling. Labels that are expensive to collect physically — such as the exact 6-DoF pose of every object in a scene — can be obtained for free in simulation, where the ground truth is known by construction. The robot operates in a physics simulator (Gazebo, MuJoCo, Isaac Sim) and every label is automatically generated. The resulting dataset is large and perfectly labeled, but may not transfer to the real world because simulated sensors differ from real ones. This gap is called the sim-to-real domain gap and is addressed in Lesson 5. Strategy 3 — Self-supervised and weakly supervised labeling. In some settings, the robot can generate its own labels from structure in the data, without human annotation. A robot arm that attempts thousands of grasps can label each attempt as success or failure from a force sensor reading — no human needed. A camera-equipped robot can use the known motion of its camera between frames to compute depth labels from stereo or motion cues. These self-supervised approaches allow scaling to millions of examples that would be prohibitively expensive to label by hand.

Match each robotics perception task to the correct supervised learning output type it requires.

Terms

Identifying whether a scene contains a fire extinguisher
Determining the exact position and orientation of a bolt in 3D space
Assigning each pixel of a camera image to one of 40 semantic categories
Predicting the joint torques needed to keep a legged robot balanced

Definitions

Regression outputting a 6-dimensional pose vector
Binary classification outputting a probability in [0,1]
Dense classification outputting a per-pixel label map
Regression outputting a continuous multi-dimensional torque vector

Drag terms onto their definitions, or click a term then click a definition to match.

Key Architectures for Robot Perception

The choice of neural network architecture depends heavily on the input modality. Convolutional Neural Networks (CNNs) remain the backbone of image-based robot perception. Their key inductive bias — that nearby pixels are more related than distant ones, and that the same feature detector applies everywhere in the image — is well-matched to visual data. Architectures like ResNet-50, EfficientDet, and YOLO have been deployed on robot platforms from surgical robots (Da Vinci) to autonomous vehicles (Waymo) to quadrupeds (Spot). Point cloud processing requires architectures designed for unordered 3D data. PointNet (2017, Stanford) was a landmark: it processes each 3D point independently then aggregates features with symmetric functions (max pooling) to achieve invariance to point order. Later architectures like PointNet++ and 3D sparse convolutions improved on local structure capture. Transformer-based architectures have emerged as universal backbones. Vision Transformers (ViT) treat an image as a sequence of patches and apply self-attention across them. More recently, architectures like RoboAgent and RT-2 (Google DeepMind, 2023) use large vision-language transformer backbones to give robots semantic understanding that generalizes across tasks — a robot that understands what 'pick up the apple' means across many different apples and scenes. For time-series data (joint angles over time, force profiles during manipulation), recurrent architectures (LSTM, GRU) and temporal convolutional networks are used, though transformers have increasingly replaced these as well.

RT-2: When Language Meets Robot Perception

Google DeepMind's RT-2 (2023) fine-tuned a large vision-language model (originally trained on internet-scale text and images) on robot demonstration data. The result was a robot that could follow novel natural-language instructions — 'pick up the object that is used to clean teeth' — without being explicitly trained on toothbrush grasping. This illustrates how large pretrained models transfer semantic knowledge to physical robot tasks.

A robotics team wants to train a model to detect whether a robot's gripper has successfully grasped an object. They plan to use force-torque sensor readings as input. What is the most important question they must answer before choosing a model architecture?

In supervised learning for robotics, the three major data collection strategies are: human of recorded sensor streams, simulation where labels are generated automatically, and learning where the robot generates its own labels from outcomes.

A CNN trained to detect objects on a warehouse conveyor belt achieves 98% accuracy in testing. When deployed, it fails to detect objects placed at the edge of the belt and drops to 71% overall accuracy. What is the most likely cause?

Design a Robot Perception Dataset

  1. You are the lead engineer for a robot that must sort recyclables from trash on a home countertop. The robot has a top-down RGB camera and a wrist-mounted RGB-D camera.
  2. Step 1: Define the supervised learning problem. Write the function signature: f(input) → output. Be specific about input dimensions and output format.
  3. Step 2: Design your data collection strategy. How will you collect 10,000 labeled training examples? Describe the specific annotation process and who or what provides the labels.
  4. Step 3: Identify three categories of objects that will be hardest to label correctly and explain why.
  5. Step 4: Describe one failure mode your trained model might exhibit in a real home that would not appear in lab testing. What property of the training data would cause this failure?
  6. Step 5: Propose a metric beyond classification accuracy that better reflects whether the robot is actually useful (hint: consider what happens when it mis-classifies a recyclable as trash, versus when it mis-classifies trash as recyclable).