Skip to main content
AI Agents & Automation

⏱ About 20 min20 XP

Evaluation: Measuring Agent Quality

There is a saying among ML engineers: you cannot improve what you cannot measure. For agents, this principle carries extra weight — because agents take actions in the world, a defective agent that goes undetected does not merely produce wrong outputs that humans can discard. It does wrong things. Evaluation is the discipline of rigorously measuring how well an agent performs, so that developers can detect problems, compare versions, and make informed decisions about whether an agent is safe to deploy.

Why Agent Evaluation Is Harder Than Classifier Evaluation

Evaluating a classifier — a model that assigns a label to an input — is well-understood. You hold out a test set, run the model on it, compare predicted labels to ground-truth labels, and compute accuracy, precision, recall, or F1. The output is a single deterministic label per input, ground truth is unambiguous, and the evaluation is cheap to run. Agent evaluation breaks each of these assumptions. First, an agent's output is not a label — it is a trajectory: a sequence of actions, decisions, and intermediate states that may span many steps. You cannot evaluate a trajectory by comparing it to a ground-truth trajectory, because there are often many valid trajectories to the same correct outcome. Second, agent behavior is stochastic: run the same agent on the same task twice and you may get different trajectories. You need multiple runs to get a stable estimate of performance. Third, some agent actions have real-world side effects — you cannot run thousands of evaluation trials against a live production database. And fourth, determining whether an agent succeeded at a complex, open-ended task often requires human judgment, which is expensive to scale.

Task Success vs. Trajectory Quality

Agent evaluation must track two distinct dimensions. Task success asks: did the agent achieve the goal? Trajectory quality asks: did it achieve the goal in an acceptable way — efficiently, safely, without unnecessary side effects? An agent that achieves the goal by brute-force trying every possible action scores well on task success but poorly on trajectory quality. Both dimensions matter.

What to Measure: The Evaluation Dimensions

A complete agent evaluation framework tracks several dimensions simultaneously. Task success rate is the fraction of tasks on a defined eval set that the agent completes correctly. Defining 'correctly' precisely is the hard part — for open-ended tasks, you may need a rubric or a judge model. For constrained tasks (booking a flight, running a query, generating a specific file), success can be verified programmatically. Step efficiency measures how many actions the agent takes per task relative to a baseline or theoretical minimum. An agent that achieves the same task in 4 steps versus 12 is not just faster — it is less likely to accumulate errors and less expensive to operate. Error rate and error type tracking records not just whether the agent failed, but how it failed — did it loop, hallucinate a tool, drift from the goal, or hit a hard error? A dashboard showing that 60% of failures are hallucinated tool calls tells you exactly where to invest engineering effort. Safety and constraint adherence measures whether the agent respected its defined limits: did it stay within its authorized scope, avoid taking irreversible actions without confirmation, and handle sensitive data correctly? Latency and cost measure the wall-clock time and resource cost per task — important for production viability even when the agent is technically correct.

A common mistake in early agent development is measuring only task success rate while ignoring the other dimensions. An agent with 85% task success that consistently takes 3x more steps than necessary, runs up $0.40 per task in API costs, and occasionally violates its safety constraints is not a production-ready agent. Evaluation must be multi-dimensional to be useful.

How to Evaluate: Automated Judges and Human Review

Because evaluating agent output requires judgment, not just comparison, evaluation pipelines commonly use a combination of automated checks and human review. Programmatic checks are the cheapest form of evaluation. For constrained tasks — 'send this email to address X', 'write a file named Y with content matching pattern Z' — success can be verified by inspecting the state of the world after the agent runs: did the email arrive, does the file exist, does it match the pattern? These checks are fast, scalable, and free of evaluator subjectivity. LLM-as-judge is a technique where a second language model reads the agent's trajectory and output and scores it against a rubric. LLM judges are faster and cheaper than human evaluators and can scale to thousands of evaluation examples. They have their own biases — they may favor verbose outputs, may not catch subtle logical errors, and may be manipulated by an agent optimized against the same judge — but for many dimensions of quality they provide useful signal. Human evaluation remains the gold standard for tasks that require genuine judgment about quality, appropriateness, or safety. Human evaluators are expensive and slow, but for establishing ground truth in a new domain, or for catching the failure modes that automated systems miss, there is no substitute.

Goodhart's Law in Agent Evaluation

Goodhart's Law states: when a measure becomes a target, it ceases to be a good measure. If you optimize an agent heavily against a specific eval set, the agent may learn to perform well on that eval without generalizing. Always hold out a test set that the agent never trains or tunes against, and periodically refresh your eval set to reflect new task types.

Match each evaluation technique to the dimension of agent quality it is best suited to measure.

Terms

Programmatic check: inspect the file system after the agent runs
Count the number of tool calls per completed task
A second LLM scores the agent's response against a helpfulness rubric
Log every instance the agent attempts an action outside its authorized scope
Track total token spend and wall-clock time per task run

Definitions

Task success on constrained, verifiable tasks
Safety and constraint adherence
Output quality on open-ended, subjective tasks
Step efficiency relative to a baseline
Latency and cost per task

Drag terms onto their definitions, or click a term then click a definition to match.

A team evaluates their customer-support agent purely by measuring whether the user's stated problem appears in the agent's final response. The agent achieves 92% on this metric. Why is this evaluation incomplete?

Why does evaluating an agent reliably require running each test case multiple times rather than once?

An agent evaluation framework should track at least five dimensions: (whether the goal was achieved), step (how many actions were needed), error rate and (what kind of failure occurred), adherence (whether the agent stayed within its limits), and latency and .

Design an Evaluation Framework

  1. You are the evaluation lead for an agent that helps users book travel — finding flights, hotels, and ground transportation given a natural-language request like 'I need to get from Seattle to Austin next Thursday for under $400 total.'
  2. Step 1: Define success criteria. For this agent, what does a 'successful' task completion look like? Write a precise definition that a programmer could use to write an automated check.
  3. Step 2: Design a 10-task eval set. Write the natural-language request for each of your 10 tasks, covering a range of complexity: simple direct requests, ambiguous requests, requests with conflicting constraints, requests the agent should refuse (e.g., asking it to book something illegal), and edge cases.
  4. Step 3: For each evaluation dimension — task success, step efficiency, error type, safety adherence, cost — describe how you would measure it for this specific agent. Be concrete: what would you log, what would you compare, what would you check programmatically vs. with a human judge?
  5. Step 4: Identify one dimension where your evaluation approach is weakest and explain why. What additional data or tooling would strengthen it?