The Autonomy Stack
A self-driving car is not a single algorithm. A surgical robot is not a single policy. Autonomous systems of any complexity are built from layers of interacting components — each handling a different aspect of the problem, each receiving inputs from the layer below and sending outputs to the layer above. This layered architecture is called the autonomy stack. Understanding the stack matters for two reasons: it shows how learning components fit into larger systems, and it reveals where each component can fail and how those failures propagate. This lesson maps the standard autonomy stack, explains what each layer does, and shows how learned and classical components divide the labor.
The Three-Layer Architecture: Perceive, Plan, Act
At the highest level, autonomous robot systems follow a perceive-plan-act loop. Each layer has a distinct function and operates at a distinct timescale. Perception layer: takes raw sensor data (camera images, LiDAR point clouds, IMU readings, force-torque measurements) and produces a structured representation of the world — a map, a set of detected objects with poses, semantic labels, depth estimates, or ego-motion estimates. Perception runs continuously at sensor rate, typically 10-100 Hz. This layer is increasingly learned: deep CNNs for object detection, neural depth estimation from stereo cameras, learned semantic segmentation. Planning layer: takes the structured world representation from perception and generates a sequence of actions or waypoints that achieves the robot's goal. Planning subdivides into: task planning (what high-level actions to take — pick up object A, place on surface B, navigate to room C), motion planning (how to move the robot's joints or base through free space without collisions), and trajectory optimization (computing smooth, dynamically feasible paths). Planning operates at lower frequency than perception — typically 1-10 Hz. This layer is a mix: task planning is increasingly learned (large language models are used to generate task plans), motion planning is largely classical (RRT, PRM, CHOMP), trajectory optimization uses both. Control layer: takes planned trajectories and converts them into actuator commands (motor voltages, joint torques) that the hardware executes. This layer runs at the highest frequency — 500-1000 Hz for joint-level control — because it must close feedback loops fast enough to maintain stability. Classical PID controllers and model-predictive control (MPC) dominate here, though learned controllers (like RL-trained locomotion policies) increasingly handle the control layer for highly dynamic tasks.
The layers of the autonomy stack operate at very different timescales: perception at sensor rate (~30 Hz), planning at decision rate (~5 Hz), control at actuator rate (~1000 Hz). This separation is not arbitrary — it reflects the physical timescales of the robot's dynamics and the computational cost of each layer. A planning algorithm running at 1000 Hz would waste computation; a controller running at 5 Hz would be too slow to stabilize a joint.
Key Components and Where Learning Enters
Within each layer, specific components handle specific sub-problems. Here is where learned versus classical methods currently dominate, and why. State estimation (perception layer): inferring the robot's own position and orientation in the world (localization) and building a map of the environment (mapping). Classical methods (SLAM — Simultaneous Localization and Mapping — using EKF or particle filters) have been highly reliable for structured environments. Learned methods (learned odometry, neural SLAM) have improved robustness in visually complex or GPS-denied environments but have not yet displaced classical SLAM in safety-critical applications. Object detection and pose estimation (perception layer): identifying what objects are in the scene and where they are in 3D space. Deep learning has completely dominated this component since 2012. Models like YOLO, Detectron2, and FoundationPose handle this better than any classical approach for general objects. Motion planning (planning layer): finding collision-free paths through configuration space. Classical sampling-based planners (RRT*, CHOMP) are fast, have completeness guarantees, and are used in almost all production manipulation systems. Learned planners have shown promise for higher-dimensional spaces but lack formal safety guarantees, limiting their deployment in certified systems. Task planning (planning layer): deciding what sequence of sub-tasks to execute. Large language models (LLMs) have emerged as powerful task planners: given a natural language goal and a description of available primitives, an LLM can generate valid task plans with surprisingly high reliability. SayCan (Google, 2022) grounded LLM task plans with real-robot affordance scores to ensure planned actions were physically feasible. Low-level control (control layer): computing joint torques or motor commands from desired trajectories. Classical model-based control (MPC, impedance control) dominates in industrial and medical robotics. RL-learned control dominates for highly dynamic tasks — legged locomotion, aerial acrobatics, high-speed manipulation — where classical models are insufficient.
Match each autonomy stack component to the method that currently dominates it in production systems.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
End-to-End Learning vs. Modular Stacks
The modular stack described above separates perception, planning, and control into independently engineered components. An alternative architecture is end-to-end learning: train a single neural network that maps raw sensor inputs directly to actuator commands, bypassing the intermediate representations entirely. End-to-end learning has attractive properties. Intermediate representations designed by engineers may discard information that would be useful — a learned end-to-end system can discover its own representations. Errors do not compound across module boundaries in the same way. And for some tasks, end-to-end systems have outperformed modular ones. But end-to-end systems have significant drawbacks. They are opaque — diagnosing why the robot took a particular action requires interpretability tools that are still immature. They are data-hungry — learning perception, planning, and control jointly requires far more data than learning each component separately on task-specific datasets. And they are hard to verify — certifying that a single monolithic neural network will behave safely under all conditions is a much harder problem than verifying modular components with well-defined specifications. The field currently favors hybrid architectures: classical components where safety, interpretability, or formal verification is needed; learned components where classical methods fail (perception, dynamic control, task planning). Waymo's autonomy stack, for example, uses deep learning for perception and learned prediction of other agents, classical planning for trajectory generation, and classical control for precise actuation — a carefully engineered division of labor.
In a modular stack, errors in an upstream layer corrupt all downstream layers. A perception module that misdetects a pedestrian as background passes a clean map to the planner — which then plans a trajectory through what it believes is free space. The planner is not wrong given its inputs; the perception failure cascades. Robust autonomy stacks include cross-layer sanity checks, uncertainty estimates, and fallback behaviors precisely to contain these cascades.
A self-driving car's autonomy stack misclassifies a large white truck against a bright sky as 'sky background' rather than 'vehicle.' This is a perception failure. What is the immediate downstream consequence for the rest of the stack?
A team debates whether to use an end-to-end neural network (raw camera → steering angle) or a modular stack for a robot that must be certified for use in a children's hospital. What is the strongest argument for the modular approach in this context?
Architect a Robot Autonomy Stack
- You are the chief robotics engineer for a hospital patient-transport robot that must navigate corridors, call elevators, avoid patients and staff, and deliver medications to rooms. The robot must be safe enough for hospital deployment.
- Step 1: Draw the three-layer stack (Perception → Planning → Control) and populate each layer with the specific components your robot needs. For each component, write: what it takes as input, what it outputs, and at what frequency it runs.
- Step 2: For each component, decide: classical algorithm, learned model, or hybrid? Justify your choice in one sentence, citing the principles from this lesson.
- Step 3: Identify the two component failures that would be most dangerous if they occurred simultaneously. Describe the failure cascade — how does each failure propagate through the stack?
- Step 4: Your hospital's certification authority requires you to demonstrate that the robot will never collide with a patient even if perception fails. What architectural safeguard do you add, and in which layer does it live?
- Step 5: A researcher proposes replacing your entire stack with a single end-to-end transformer trained on navigation demonstrations from 1,000 hospital environments. Write a two-paragraph response evaluating this proposal for your specific deployment context.