Skip to main content
AI Agents & Automation

⏱ About 15 min15 XP

When Automation Goes Wrong

In 2010, a cascade of automated trading systems interacting with each other caused the United States stock market to lose nearly a trillion dollars in value in about thirty-five minutes — then recover almost entirely within the hour. No human decided to crash the market. The systems, each following their own automated logic, amplified each other's reactions in an uncontrolled spiral. This event — called the Flash Crash — is one of the most dramatic examples of what can happen when automation goes wrong at scale. AI agent systems face the same categories of risk, and understanding them is just as important as understanding how they work when everything goes right.

Error Cascades: When One Bad Output Poisons the Rest

In a multi-agent pipeline, the output of one agent becomes the input of the next. This is powerful when everything is working — but fragile when it is not. If an early stage produces incorrect output and the system does not catch the error, that bad output flows downstream and every subsequent agent builds on a faulty foundation. This is called an error cascade. A research agent that retrieves the wrong study passes false facts to a writing agent, which builds a plausible-sounding but factually incorrect article, which a publishing agent then posts to a real website. At no point did any human review the content. Each agent did its local job competently — but the error introduced at step one corrupted the entire result.

Error Cascade

An error cascade occurs when an incorrect output from an early stage propagates through a pipeline, causing each subsequent stage to build on a faulty foundation. The final output may be confidently wrong — and hard to trace back to the original mistake.

Runaway Loops and Infinite Retries

Automation systems are often designed to retry failed tasks automatically — if an agent times out or returns an error, try again. This is sensible for transient failures, like a momentary network hiccup. But retry logic can become a trap. If an agent fails repeatedly for a systematic reason — a corrupt input file, a broken API it depends on, a task it was never capable of completing — a naive retry loop will attempt the same failing action hundreds or thousands of times. Each attempt may consume computing resources, trigger external API calls (with associated costs), or generate error logs so large the system becomes unmanageable. Without a maximum retry limit and a clear failure state, automation can thrash indefinitely.

Runaway Loop

A runaway loop occurs when a retry or repetition mechanism fires without a stopping condition, causing an agent to attempt the same failing action indefinitely. Always design retry logic with a maximum attempt count and a defined failure state.

Unintended Side Effects

Agents that take actions in the real world — sending emails, posting content, making purchases, modifying databases — can cause unintended side effects when they misbehave. A communication agent given incorrect instructions might send the same email to ten thousand customers five times in a row. A purchasing agent given a loop in its logic might reorder a product hundreds of times. These are not hypothetical: real companies have accidentally sent mass duplicate emails and generated runaway orders because of bugs in automated systems. Unlike a bug in a document that only damages a file, a bug in an action-taking agent can damage relationships, incur real costs, and be impossible to fully undo. This is why consequential actions — anything that affects the outside world — deserve extra scrutiny in automated systems.

Side Effects Are Hard to Undo

When an automated agent takes an action that affects the outside world — sending a message, making a purchase, posting content — mistakes are much harder to reverse than internal computation errors. Design action-taking agents with extra safeguards and human-in-the-loop review for high-stakes actions.

Safeguards Engineers Build

Robust automated systems do not assume everything will work perfectly — they are designed expecting failure and include mechanisms to detect, contain, and recover from problems. Validation checks inspect an agent's output before passing it downstream: is it the right format? Does it fall within expected ranges? Does it contain required fields? A validation check that catches a malformed research summary before it reaches the writing agent stops an error cascade before it begins. Circuit breakers monitor how often a particular agent or step is failing. If failures exceed a threshold, the circuit breaker pauses the workflow and alerts a human — preventing a runaway loop and giving engineers time to diagnose the problem. Dead-letter queues hold failed tasks that could not be processed, rather than discarding them silently. Engineers can review the queue, understand what failed, fix the problem, and reprocess the tasks. Audit logs record every action every agent takes, with timestamps and inputs and outputs. When something goes wrong, the audit log is the first place investigators look to understand what happened and when.

Match each safeguard to the failure mode it is designed to prevent or contain.

Terms

Validation check on agent output
Circuit breaker with failure threshold
Human-in-the-loop approval step
Audit log of all agent actions

Definitions

Enables investigators to trace exactly what each agent did when diagnosing a failure
Halts a runaway retry loop and alerts a human when failures exceed a limit
Prevents unintended side effects by requiring review before a consequential action
Stops an error cascade by catching bad output before it flows downstream

Drag terms onto their definitions, or click a term then click a definition to match.

Flashcards — click each card to reveal the answer

A research agent retrieves the wrong data source, and neither the writing agent nor the publishing agent detects the error. The final article contains false information. What failure mode is this?

Why do automation engineers add maximum retry limits to systems that automatically retry failed tasks?

Failure Mode Analysis

  1. Scenario: An automated school communication system uses three agents in a pipeline.
  2. Agent 1 (Data Agent): Reads the attendance database and identifies students who missed class.
  3. Agent 2 (Writing Agent): Drafts a personalized absence notification for each student's parent.
  4. Agent 3 (Communication Agent): Sends the notification emails.
  5. Step 1: Describe an error cascade that could occur in this system. What error in Agent 1's output could cause a serious problem by the time Agent 3 acts?
  6. Step 2: Describe a runaway loop scenario. What retry situation could cause Agent 3 to send hundreds of duplicate emails?
  7. Step 3: Propose one safeguard at each of the three agent stages that would prevent or catch the failure you described.
  8. Step 4: Explain in two sentences why a human-in-the-loop approval step between Agent 2 and Agent 3 might be worth the added delay for this particular system.