Skip to main content
AI Safety, Alignment & Ethics

⏱ About 20 min20 XP

Red-Teaming and Stress-Testing

Standard evaluation runs a system through a fixed test set and reports accuracy. Red-teaming takes a different approach: it sends humans — often adversarial, creative, motivated humans — to actively search for failure modes the developers did not anticipate. The name comes from military exercises where a 'red team' plays the adversary to stress-test plans and defenses. In AI safety, red-teaming is the practice of trying to break a system before it is deployed, so that failures are found in a controlled setting rather than in the real world.

What Red-Teaming Looks For

Red-teamers for AI systems are looking for behaviors the system should not exhibit — and for conditions that elicit those behaviors. For language models, the most commonly probed failure modes include: generating instructions for creating weapons, synthesizing dangerous chemicals, or planning violence; producing content that sexualizes minors; generating targeted harassment or credible threats against real people; spreading clearly false information presented as fact; and being manipulated through prompt injection or jailbreak techniques into ignoring safety constraints. For vision systems and autonomous agents, red-teamers probe for: adversarial input vulnerabilities; failure modes under unusual lighting, weather, or sensor conditions; unexpected behaviors at distribution boundaries; and interactions between components that produce emergent failures no single component exhibits. For all AI systems, red-teamers also look for: subtle biases that aggregate evaluation misses; edge cases in real-world deployment contexts that developers did not consider; failure modes that only appear after extended interaction; and ways the system can be abused for purposes other than its intended use.

The Value of Adversarial Creativity

Standard QA testing asks: does the system do what it is supposed to do? Red-teaming asks: what can we make it do that it should not? This requires a fundamentally different mindset — not checking compliance with a specification, but actively imagining misuse, edge cases, and creative attacks. The best red-teamers have domain expertise, adversarial creativity, and psychological distance from the product. Developers are often poor red-teamers of their own systems because they are optimistic about their creation and blind to their assumptions.

Red-Teaming Methodologies

Red-teaming is not purely improvised — it has developed structured methodologies. Threat modeling begins red-teaming by asking: who might attack this system, with what capabilities, and what is their motivation? A model used for customer service has different threats than a model embedded in a national security context. Threat modeling shapes which attack surfaces red-teamers prioritize. Attack trees decompose threats hierarchically. The root of the tree is an outcome the attacker wants (e.g., 'extract training data'). Branches are alternative methods (inference attacks, membership inference, model inversion). Leaves are specific technical actions. Attack trees make coverage explicit — red-teamers can see which branches they have and have not tested. Jailbreak taxonomy: for language models, researchers have classified jailbreak techniques into categories: role-playing (asking the model to pretend to be a different AI with no restrictions), hypothetical framing (asking the model to describe what a character in a story would do), encoding tricks (asking for information in base64 or other encoding to bypass filters), and multi-step decomposition (breaking a harmful request into individually harmless sub-requests). Taxonomies help red-teams ensure systematic coverage. Automated red-teaming augments human red-teamers with AI systems trained to generate adversarial prompts. Language models fine-tuned to find failure modes can explore a much larger space of inputs than human teams working alone, though they require human judgment to evaluate the outputs. After each red-teaming exercise, findings are documented in a structured format: the attack, the behavior observed, the severity, the conditions required, and recommended mitigations. This documentation feeds directly into the next training or fine-tuning cycle.

Match each red-teaming concept to its defining description.

Terms

Threat modeling
Attack tree
Jailbreak via role-playing
Automated red-teaming
Multi-step decomposition attack

Definitions

Hierarchical breakdown of an attacker's goal into alternative methods and specific technical steps
Systematic analysis of who might attack a system, with what capabilities, and toward what end
Asking the model to act as a character or alternate AI that has no safety restrictions
Breaking a harmful request into individually innocent sub-requests to bypass filters
Using an AI to generate adversarial inputs at scale, augmenting human testers

Drag terms onto their definitions, or click a term then click a definition to match.

Limits of Red-Teaming

Red-teaming is valuable, but it has irreducible limitations that safety practitioners must understand. Coverage is finite. The space of possible inputs to a language model is essentially infinite. Red-teams can only sample from it. A system can pass a red-teaming exercise and still exhibit failures on inputs the team did not try. The absence of found failures is not proof of safety. Red teams reflect their own demographics and assumptions. A red team that lacks relevant expertise — domain knowledge about weapons synthesis, or cultural knowledge about contexts where a system will be deployed — will miss the attack surfaces those experts would find immediately. Diverse red teams find more diverse failure modes. Findings date quickly. A jailbreak discovered and mitigated during red-teaming may be reinvented by users after deployment. Models updated to fix one vulnerability may develop different ones. Red-teaming is not a one-time exercise but a continuous practice. High-stakes rare events are hard to probe. A model that refuses harmful requests 99.9% of the time will still produce them if exposed to a billion queries. Red-teaming at small scale will not reliably find failure modes that occur with very low probability — but those modes may be catastrophic when they do occur.

Responsible Disclosure and Ethics of Red-Teaming

When red-teamers find serious vulnerabilities — particularly in deployed systems — they face a responsible disclosure dilemma: inform the developer privately and give time to patch before going public, or disclose immediately. In AI safety, the norms are still developing. Many major AI labs now have bug-bounty programs and responsible disclosure policies. Red-teamers working outside these formal programs should think carefully about the potential harm of releasing details of serious vulnerabilities before they are patched.

A red-team tests a language model by asking it to write a story in which a character explains how to make a dangerous substance. The model complies. This attack technique is best classified as:

A red-team runs 10,000 adversarial prompts against a language model and finds no harmful outputs. The correct conclusion is:

Build a Threat Model for a Deployed AI

  1. Choose an AI system: a customer service chatbot for a bank, an AI tutor for middle-school students, a language model embedded in a code editor, or a content recommendation system for a news platform.
  2. Step 1: Identify three distinct attacker profiles: who they are, what they want, and what capabilities they have.
  3. Step 2: For each attacker, identify their two most likely attack vectors against your chosen system.
  4. Step 3: Build a simple attack tree for the most serious threat: draw the goal at the top, two or three alternative approaches branching from it, and specific techniques at the leaves.
  5. Step 4: Prioritize: which branch of the attack tree represents the greatest risk? Why?
  6. Step 5: For the highest-priority branch, propose a specific technical and a specific operational mitigation.
  7. Write your findings as a one-page threat model memo. Your audience is a product safety team preparing for red-team exercises.
Red-Teaming and Stress-Testing — Owens AI Institute | HYVE CARES