揭秘AI代理的评估 Demystifying evals for AI agents —— Anthropic

RR1335

707人浏览 · 2026-05-02 09:36:37

RR1335 · 2026-05-02 09:36:37 发布

Demystifying evals for AI agents

揭秘AI代理的评估

The capabilities that make agents useful also make them difficult to evaluate. The strategies that work across deployments combine techniques to match the complexity of the systems they measure.

使智能体有用的能力也让它们难以评估。适用于各种部署场景的策略需结合多种技术手段，以匹配其所衡量系统的复杂性。

Introduction

Good evaluations help teams ship AI agents more confidently. Without them, it’s easy to get stuck in reactive loops—catching issues only in production, where fixing one failure creates others. Evals make problems and behavioral changes visible before they affect users, and their value compounds over the lifecycle of an agent.

良好的评估能帮助团队更有信心地部署AI智能体。若缺乏评估机制，团队很容易陷入被动循环——只能在生产环境中发现问题，而修复一个故障又会导致其他问题。评估能在问题影响用户之前使其可见，同时让行为变化显性化，其价值会随着智能体生命周期的推进而持续放大。

As we described in Building effective agents, agents operate over many turns: calling tools, modifying state, and adapting based on intermediate results. These same capabilities that make AI agents useful—autonomy, intelligence, and flexibility—also make them harder to evaluate.

正如我们在《构建高效智能体》中所描述的，智能体通过多轮操作运行：调用工具、修改状态并根据中间结果进行自适应调整。正是这些使AI智能体具备实用价值的特性——自主性、智能性和灵活性——同时也使得它们更难以评估。

Through our internal work and with customers at the frontier of agent development, we’ve learned how to design more rigorous and useful evals for agents. Here's what's worked across a range of agent architectures and use cases in real-world deployment.

通过我们内部的研发工作以及与处于智能体开发前沿的客户合作，我们掌握了如何为智能体设计更严谨有效的评估体系。以下是经过各类智能体架构和实际应用场景验证的有效方法论。

The structure of an evaluation

An evaluation (“eval”) is a test for an AI system: give an AI an input, then apply grading logic to its output to measure success. In this post, we focus on automated evals that can be run during development without real users.

评估的结构
评估（“eval”）是对人工智能系统的测试：给AI一个输入，然后对其输出应用评分逻辑以衡量成功与否。在本文中，我们重点关注可在开发过程中运行而无需真实用户的自动化评估。

Single-turn evaluations are straightforward: a prompt, a response, and grading logic. For earlier LLMs, single-turn, non-agentic evals were the main evaluation method. As AI capabilities have advanced, multi-turn evaluations have become increasingly common.

单轮评估很简单：一个提示、一个回应和评分逻辑。对于早期的LLM，单轮非代理评估是主要的评估方法。随着AI能力的进步，多轮评估变得越来越普遍。

In a simple eval, an agent processes a prompt, and a grader checks if the output matches expectations. For a more complex multi-turn eval, a coding agent receives tools, a task (building an MCP server in this case), and an environment, executes an "agent loop" (tool calls and reasoning), and updates the environment with the implementation. Grading then uses unit tests to verify the working MCP server.

在一次简单的评估中，智能体处理提示，评分者检查输出是否符合预期。而在更复杂的多轮评估中，编码智能体会接收工具、任务（此处指构建MCP服务器）和环境，执行"智能体循环"（工具调用与推理），并通过实现方案更新环境。评分阶段则使用单元测试来验证可运行的MCP服务器。

Agent evaluations are even more complex. Agents use tools across many turns, modifying state in the environment and adapting as they go—which means mistakes can propagate and compound. Frontier models can also find creative solutions that surpass the limits of static evals. For instance, Opus 4.5 solved a 𝜏2-bench problem about booking a flight by discovering a loophole in the policy. It “failed” the evaluation as written, but actually came up with a better solution for the user.

智能体评估更为复杂。智能体在多轮交互中使用工具，不断修改环境状态并实时调整——这意味着错误可能传播并叠加。前沿模型还能发现突破静态评估限制的创新解决方案。例如，Opus 4.5通过发现政策漏洞解决了关于机票预订的𝜏2-bench问题。按既定标准它"未通过"评估，但实际上为用户提供了更优解决方案。

When building agent evaluations, we use the following definitions:

A task (a.k.a problem or test case) is a single test with defined inputs and success criteria.
Each attempt at a task is a trial. Because model outputs vary between runs, we run multiple trials to produce more consistent results.
A grader is logic that scores some aspect of the agent’s performance. A task can have multiple graders, each containing multiple assertions (sometimes called checks).
A transcript (also called a trace or trajectory) is the complete record of a trial, including outputs, tool calls, reasoning, intermediate results, and any other interactions. For the Anthropic API, this is the full messages array at the end of an eval run - containing all the calls to the API and all of the returned responses during the evaluation.
The outcome is the final state in the environment at the end of the trial. A flight-booking agent might say “Your flight has been booked” at the end of the transcript, but the outcome is whether a reservation exists in the environment’s SQL database.
An evaluation harness is the infrastructure that runs evals end-to-end. It provides instructions and tools, runs tasks concurrently, records all the steps, grades outputs, and aggregates results.
An agent harness (or scaffold) is the system that enables a model to act as an agent: it processes inputs, orchestrates tool calls, and returns results. When we evaluate “an agent,” we’re evaluating the harness and the model working together. For example, Claude Code is a flexible agent harness, and we used its core primitives through the Agent SDK to build our long-running agent harness.
An evaluation suite is a collection of tasks designed to measure specific capabilities or behaviors. Tasks in a suite typically share a broad goal. For instance, a customer support eval suite might test refunds, cancellations, and escalations.

在构建智能体评估体系时，我们采用以下定义：

任务（亦称问题或测试用例）是指具有明确输入和成功标准的单项测试。每次执行任务称为一次试验。由于模型输出存在波动性，我们通过多次试验来获得更稳定的结果。

评分器是用于评估智能体某方面表现的逻辑模块。单个任务可包含多个评分器，每个评分器又包含若干断言（有时称为检查项）。

执行记录（也称追踪或轨迹）完整记载试验全过程，包括输出内容、工具调用、推理过程、中间结果及其他交互数据。对于Anthropic API而言，这相当于评估运行结束时的完整消息数组——包含评估期间所有API调用及返回响应。

终态指试验结束时环境的最终状态。例如机票预订智能体可能在记录末尾显示"已完成订票"，但真正的终态取决于环境SQL数据库中是否存在相应预订记录。

评估框架是实现端到端评估的基础设施，负责提供指令工具、并行执行任务、记录所有步骤、评分输出结果并汇总数据。

智能体框架（或称脚手架）是使模型具备智能体功能的系统：处理输入、协调工具调用并返回结果。当我们评估"某个智能体"时，实际评估的是框架与模型的协同表现。例如Claude Code就是灵活型智能体框架，我们通过Agent SDK调用其核心组件构建了长效智能体框架。

评估套件是为检测特定能力或行为而设计的任务集合，套件中的任务通常服务于共同目标。例如客户服务评估套件可能包含退款、取消订单、升级投诉等测试模块。

Why build evaluations?

When teams first start building agents, they can get surprisingly far through a combination of manual testing, dogfooding, and intuition. More rigorous evaluation may even seem like overhead that slows down shipping. But after the early prototyping stages, once an agent is in production and has started scaling, building without evals starts to break down.

为什么要建立评估机制？
当团队刚开始构建智能体时，通过手动测试、内部试用和直觉判断，往往能取得超乎预期的进展。更严格的评估甚至可能被视为拖慢交付进度的负担。但在早期原型阶段之后，一旦智能体投入生产并开始规模化，缺乏评估的构建方式就会开始失效。

The breaking point often comes when users report the agent feels worse after changes, and the team is “flying blind” with no way to verify except to guess and check. Absent evals, debugging is reactive: wait for complaints, reproduce manually, fix the bug, and hope nothing else regressed. Teams can't distinguish real regressions from noise, automatically test changes against hundreds of scenarios before shipping, or measure improvements.

当用户反馈更改后智能体表现更差时，临界点往往就会出现，而团队只能"盲目飞行"，除了猜测和检查外别无验证方法。没有评估机制的情况下，调试是被动的：等待投诉、手动复现问题、修复错误，然后祈祷不会引发其他倒退。团队无法区分真正的性能倒退与随机波动，无法在发布前自动测试数百种场景的变更影响，也无法量化改进效果。

We’ve seen this progression play out many times. For instance, Claude Code started with fast iteration based on feedback from Anthropic employees and external users. Later, we added evals—first for narrow areas like concision and file edits, and then for more complex behaviors like over-engineering. These evals helped identify issues, guide improvements, and focus research-product collaborations. Combined with production monitoring, A/B tests, user research, and more, evals provide signals to continue improving Claude Code as it scales.

我们已多次见证这一发展进程。以Claude Code为例，最初我们根据Anthropic员工和外部用户的反馈进行快速迭代，随后逐步引入评估体系——先针对代码简洁性、文件编辑等具体维度，再扩展到过度工程化等复杂行为评估。这些评估机制不仅能发现问题、指导优化，更能聚焦研发与产品的协作重点。配合生产环境监控、A/B测试和用户研究等手段，评估体系为Claude Code的规模化升级提供了持续改进的方向标。

Writing evals is useful at any stage in the agent lifecycle. Early on, evals force product teams to specify what success means for the agent, while later they help uphold a consistent quality bar.

编写评估在智能体生命周期的任何阶段都很有用。早期，评估能促使产品团队明确智能体的成功标准；后期则有助于维持稳定的质量门槛。

Descript’s agent helps users edit videos, so they built evals around three dimensions of a successful editing workflow: don’t break things, do what I asked, and do it well. They evolved from manual grading to LLM graders with criteria defined by the product team and periodic human calibration, and now regularly run two separate suites for quality benchmarking and regression testing. The Bolt AI team started building evals later, after they already had a widely used agent. In 3 months, they built an eval system that runs their agent and grades outputs with static analysis, uses browser agents to test apps, and employs LLM judges for behaviors like instruction following.

的代理帮助用户编辑视频，因此他们围绕成功编辑工作流程的三个维度构建了评估体系：不破坏内容、按要求执行以及出色完成。他们从人工评分逐步发展为采用产品团队定义的标准并结合定期人工校准的LLM评分系统，如今定期运行两套独立测试集进行质量基准评估和回归测试。Bolt AI团队在代理已广泛使用后才开始构建评估系统，仅用3个月就搭建出一套能运行代理程序、通过静态分析评分输出、利用浏览器代理测试应用程序，并采用LLM裁判评估指令遵循等行为的评估体系。

Some teams create evals at the start of development; others add them once at scale when evals become a bottleneck for improving the agent. Evals are especially useful at the start of agent development to explicitly encode expected behavior. Two engineers reading the same initial spec could come away with different interpretations on how the AI should handle edge cases. An eval suite resolves this ambiguity. Regardless of when they’re created, evals help accelerate development.

Evals also shape how quickly you can adopt new models. When more powerful models come out, teams without evals face weeks of testing while competitors with evals can quickly determine the model’s strengths, tune their prompts, and upgrade in days.

一些团队在开发初期就创建评估体系；另一些团队则等到规模扩大、评估成为改进智能体瓶颈时才一次性添加。评估在智能体开发初期尤为有用，能明确编码预期行为。两名工程师阅读同一份初始需求文档，对AI应如何处理边缘案例可能产生不同理解。评估套件能消除这种歧义。无论何时创建，评估都有助于加速开发进程。

评估还决定了采用新模型的速度。当更强大的模型发布时，没有评估体系的团队需要数周测试，而具备评估体系的竞争者能快速判定模型优势、调整提示词，在数日内完成升级。

Once evals exist, you get baselines and regression tests for free: latency, token usage, cost per task, and error rates can be tracked on a static bank of tasks. Evals can also become the highest-bandwidth communication channel between product and research teams, defining metrics researchers can optimize against. Clearly, evals have wide-ranging benefits beyond tracking regressions and improvements. Their compounding value is easy to miss given that costs are visible upfront while benefits accumulate later.

一旦有了评估体系，你就能免费获得基准和回归测试：可以在静态任务库中跟踪延迟、令牌使用量、每项任务的成本和错误率。评估也可以成为产品团队和研究团队之间最高效的沟通渠道，定义研究人员可以优化的指标。显然，评估的好处远远超出了跟踪回归和改进。由于成本是预先可见的，而收益是后来积累的，它们的复合价值很容易被忽视。

How to evaluate AI agents

We see several common types of agents deployed at scale today, including coding agents, research agents, computer use agents, and conversational agents. Each type may be deployed across a wide variety of industries, but they can be evaluated using similar techniques. You don’t need to invent an evaluation from scratch. The sections below describe proven techniques for several agent types. Use these methods as a foundation, then extend them to your domain.

Types of graders for agents

Agent evaluations typically combine three types of graders: code-based, model-based, and human. Each grader evaluates some portion of either the transcript or the outcome. An essential component of effective evaluation design is to choose the right graders for the job.

如何评估AI智能体
如今，我们看到几种常见类型的智能体被大规模部署，包括编程智能体、研究智能体、计算机操作智能体和对话智能体。每种类型都可以应用于各种行业，但可以使用相似的技术进行评估。你不需要从头开始设计评估方法。以下部分介绍了几种智能体类型的成熟评估技术。以这些方法为基础，再根据你的领域进行扩展。

智能体的评估类型
智能体评估通常结合三种评估方式：基于代码的评估、基于模型的评估和人工评估。每种评估方式都会对交互记录或结果的一部分进行评估。有效的评估设计关键在于为任务选择合适的评估方式。

For each task, scoring can be weighted (combined grader scores must hit a threshold), binary (all graders must pass), or a hybrid.

对于每项任务，评分可采用加权方式（综合评分需达到阈值）、二元方式（所有评分者必须通过）或混合方式。

Capability vs. regression evals

Capability or “quality” evals ask, “What can this agent do well?” They should start at a low pass rate, targeting tasks the agent struggles with and giving teams a hill to climb.

Regression evals ask, “Does the agent still handle all the tasks it used to?” and should have a nearly 100% pass rate. They protect against backsliding, as a decline in score signals that something is broken and needs to be improved. As teams hill-climb on capability evals, it’s important to also run regression evals to make sure changes don’t cause issues elsewhere.

After an agent is launched and optimized, capability evals with high pass rates can “graduate” to become a regression suite that is run continuously to catch any drift. Tasks that once measured “Can we do this at all?” then measure “Can we still do this reliably?”

能力评估与回归评估
能力评估（或称"质量"评估）旨在回答："这个智能体擅长做什么？"这类评估初始通过率应较低，针对智能体难以完成的任务，为团队设立进阶目标。

回归评估则追问："智能体是否仍能妥善处理既往任务？"其通过率应接近100%。这类评估用于防止性能倒退——分数下降即意味着存在故障需修复。当团队通过能力评估逐步提升时，必须同步进行回归评估，以确保改动不会引发其他问题。

当智能体完成发布和优化后，高通过率的能力评估可"升级"为持续运行的回归测试套件，用于监测性能偏差。曾经衡量"我们能否实现这个功能？"的任务，将转变为检验"我们能否持续稳定实现该功能？"

Evaluating coding agents

Coding agents write, test, and debug code, navigating codebases and running commands much like a human developer. Effective evals for modern coding agents usually rely on well-specified tasks, stable test environments, and thorough tests for the generated code.

Deterministic graders are natural for coding agents because software is generally straightforward to evaluate: does the code run and do the tests pass? Two widely used coding agent benchmarks, SWE-bench Verified and Terminal-Bench, follow this approach. SWE-bench Verified gives agents GitHub issues from popular Python repositories and grades solutions by running the test suite; a solution passes only if it fixes the failing tests without breaking existing ones. LLMs have progressed from 40% to >80% on this eval in just one year. Terminal-Bench takes a different track: it tests end-to-end technical tasks, such as building a Linux kernel from source or training an ML model.

Once you have a set of pass-or-fail tests for validating the key outcomes of a coding task, it’s often useful to also grade the transcript. For instance, heuristics-based code quality rules can evaluate the generated code based on more than passing tests, and model-based graders with clear rubrics can assess behaviors like how the agent calls tools or interacts with the user.

评估编程代理
编程代理能编写、测试和调试代码，浏览代码库并运行命令，其操作方式与人类开发者类似。对现代编程代理的有效评估通常依赖于明确指定的任务、稳定的测试环境以及对生成代码的全面测试。

确定性评分器天然适用于编程代理，因为软件评估通常较为直接：代码能否运行？测试是否通过？两个广泛使用的编程代理基准测试——SWE-bench Verified和Terminal-Bench——均采用这一思路。SWE-bench Verified向代理提供热门Python代码库的GitHub问题，并通过运行测试套件对解决方案评分；仅当解决方案修复了失败的测试且未破坏现有测试时，才会判定为通过。仅一年内，大语言模型在该评估中的通过率就从40%提升至80%以上。Terminal-Bench则另辟蹊径：它测试端到端技术任务，例如从源代码构建Linux内核或训练机器学习模型。

当您拥有一组用于验证编程任务关键结果的通过/失败测试后，对操作记录进行评分也常具价值。例如，基于启发式的代码质量规则可超越测试通过与否来评估生成代码，而带有明确评分标准的模型评分器能评估代理调用工具或与用户交互等行为。

Example: Theoretical evaluation for a coding agent

Consider a coding task where the agent must fix an authentication bypass vulnerability. As shown in the illustrative YAML file below, one could evaluate this agent using both graders and metrics.

示例：编码代理的理论评估

考虑一个编码任务，代理必须修复身份验证绕过漏洞。如下面的示例YAML文件所示，可以使用评分器和指标来评估该代理。

task:
  id: "fix-auth-bypass_1"
  desc: "Fix authentication bypass when password field is empty and ..."
  graders:
    - type: deterministic_tests
      required: [test_empty_pw_rejected.py, test_null_pw_rejected.py]
    - type: llm_rubric
      rubric: prompts/code_quality.md
    - type: static_analysis
      commands: [ruff, mypy, bandit]
    - type: state_check
      expect:
        security_logs: {event_type: "auth_blocked"}
    - type: tool_calls
      required:
        - {tool: read_file, params: {path: "src/auth/*"}}
        - {tool: edit_file}
        - {tool: run_tests}
  tracked_metrics:
    - type: transcript
      metrics:
        - n_turns
        - n_toolcalls
        - n_total_tokens
    - type: latency
      metrics:
        - time_to_first_token
        - output_tokens_per_sec
        - time_to_last_token

Note that this example showcases the full range of available graders for illustration. In practice, coding evaluations typically rely on unit tests for correctness verification and an LLM rubric for assessing overall code quality, with additional graders and metrics added only as needed.

请注意，此示例展示了所有可用的评分项以作说明。实际应用中，代码评估通常依靠单元测试来验证正确性，并使用LLM评分标准评估整体代码质量，其他评分项和指标仅在需要时添加。

Evaluating conversational agents

Conversational agents interact with users in domains like support, sales, or coaching. Unlike traditional chatbots, they maintain state, use tools, and take actions mid-conversation. While coding and research agents can also involve many turns of interaction with the user, conversational agents present a distinct challenge: the quality of the interaction itself is part of what you're evaluating. Effective evals for conversational agents usually rely on verifiable end-state outcomes and rubrics that capture both task completion and interaction quality. Unlike most other evals, they often require a second LLM to simulate the user. We use this approach in our alignment auditing agents to stress-test models through extended, adversarial conversations.

Success for conversational agents can be multidimensional: is the ticket resolved (state check), did it finish in <10 turns (transcript constraint), and was the tone appropriate (LLM rubric)? Two benchmarks that incorporate multidimensionality are 𝜏-Bench and its successor, τ2-Bench. These simulate multi-turn interactions across domains like retail support and airline booking, where one model plays a user persona while the agent navigates realistic scenarios.

评估对话式智能体
对话式智能体在客服、销售或辅导等领域与用户互动。与传统聊天机器人不同，它们能保持状态、使用工具并在对话过程中执行操作。虽然编程和研究型智能体也可能涉及多轮用户交互，但对话式智能体的评估面临独特挑战：交互质量本身就是评估的关键维度。

有效的评估通常依赖可验证的终态结果和评估量规，既要衡量任务完成度，也要考量交互质量。与其他评估不同，这类评估常需借助另一个大语言模型来模拟用户行为。我们在对齐审核智能体中采用该方法，通过长时间对抗性对话对模型进行压力测试。

对话式智能体的成功标准具有多维性：工单是否解决（状态检查）？交互是否在10轮内完成（对话轮次限制）？语气是否恰当（大模型量规评分）？体现这种多维性的基准测试包括𝜏-Bench及其升级版τ2-Bench，它们模拟零售客服、机票预订等领域的多轮交互场景，由一个模型扮演用户角色，另一个作为智能体应对真实情境。

（注：根据技术文本特性，保留"LLM rubric"等专业术语的英文缩写；"alignment auditing agents"译为行业通用表述"对齐审核智能体"；"adversarial conversations"采用计算机安全领域惯用译法"对抗性对话"）

Example: Theoretical evaluation for a conversational agent

Consider a support task where the agent must handle a refund for a frustrated customer.

示例：对话代理的理论评估

考虑一个支持任务，代理必须为一位沮丧的客户处理退款。

graders:
  - type: llm_rubric
    rubric: prompts/support_quality.md
    assertions:
      - "Agent showed empathy for customer's frustration"
      - "Resolution was clearly explained"
      - "Agent's response grounded in fetch_policy tool results"
  - type: state_check
    expect:
      tickets: {status: resolved}
      refunds: {status: processed}
  - type: tool_calls
    required:
      - {tool: verify_identity}
      - {tool: process_refund, params: {amount: "<=100"}}
      - {tool: send_confirmation}
  - type: transcript
    max_turns: 10
tracked_metrics:
  - type: transcript
    metrics:
      - n_turns
      - n_toolcalls
      - n_total_tokens
  - type: latency
    metrics:
      - time_to_first_token
      - output_tokens_per_sec
      - time_to_last_token

As in our coding agent example, this task showcases multiple grader types for illustration. In practice, conversational agent evaluations typically use model-based graders to assess both communication quality and goal completion, because many tasks—like answering a question—may have multiple “correct” solutions.

正如我们的编码代理示例所示，该任务展示了多种评分器类型以供说明。实际上，对话代理评估通常使用基于模型的评分器来评估沟通质量和目标完成情况，因为许多任务（如回答问题）可能存在多个"正确"解决方案。

Evaluating research agents

Research agents gather, synthesize, and analyze information, then produce outputs like an answer or report. Unlike coding agents where unit tests provide binary pass/fail signals, research quality can only be judged relative to the task. What counts as “comprehensive,” “well-sourced,” or even “correct” depends on context: a market scan, due diligence for an acquisition, and a scientific report each require different standards.

Research evals face unique challenges: experts may disagree on whether a synthesis is comprehensive, ground truth shifts as reference content changes constantly, and longer, more open-ended outputs create more room for mistakes. A benchmark like BrowseComp, for example, tests whether AI agents can find needles in haystacks across the open web—questions designed to be easy to verify but hard to solve.

One strategy to build research agent evals is to combine grader types. Groundedness checks verify that claims are supported by retrieved sources, coverage checks define key facts a good answer must include, and source quality checks confirm the consulted sources are authoritative, rather than simply the first retrieved. For tasks with objectively correct answers (“What was Company X’s Q3 revenue?”), exact match works. An LLM can flag unsupported claims and gaps in coverage but also verify the open-ended synthesis for coherence and completeness.

Given the subjective nature of research quality, LLM-based rubrics should be frequently calibrated against expert human judgment to grade these agents effectively.

评估研究型智能体
研究型智能体会收集、整合并分析信息，随后生成答案或报告等输出成果。与编码智能体不同（其单元测试能提供明确的通过/失败信号），研究质量只能根据具体任务来评判。所谓"全面"、"来源可靠"甚至"正确"的标准都取决于情境：市场调研、收购尽职调查和科学报告各自需要不同的评估尺度。

研究评估面临独特挑战：专家可能对信息整合是否全面存在分歧，事实依据会随着参考内容的持续更新而变化，而篇幅更长、开放性更强的输出也更容易出现纰漏。例如BrowseComp这类基准测试，就是检验AI智能体能否在开放网络的海量信息中精准定位目标——这类问题设计得易于验证却难以解答。

构建研究型智能体评估体系时，可采用分级评估组合策略：事实锚定检查确保论断有检索来源支撑，覆盖度检查界定优质答案必须包含的关键事实，而来源质量检查则确认所参考资料的权威性，而非简单采用首次检索结果。对于存在客观正确答案的任务（如"X公司第三季度营收是多少？"），直接匹配即可。大语言模型既能标记缺乏依据的论断和内容缺口，也能评估开放性综合成果的逻辑连贯性与完整度。

鉴于研究质量的主观特性，基于大语言模型的评估标准需定期对照专家人工判断进行校准，以实现有效评级。

（注：翻译中对"ground truth shifts"采用"事实依据变化"的意译，"needles in haystacks"保留喻体译为"海量信息中精准定位目标"；专业术语如"BrowseComp"不作翻译；长难句通过拆分和语序调整确保中文流畅性；"graded"根据上下文译为"评级"而非字面意义的"打分"）

Computer use agents

Computer use agents interact with software through the same interface as humans—screenshots, mouse clicks, keyboard inputs, and scrolling—rather than through APIs or code execution. They can use any application with a graphical user interface (GUI), from design tools to legacy enterprise software. Evaluation requires running the agent in a real or sandboxed environment where it can use software applications and checking whether it achieved the intended outcome. For instance, WebArena tests browser-based tasks, using URL and page state checks to verify the agent navigated correctly, along with backend state verification for tasks that modify data (confirming an order was actually placed, not just that the confirmation page appeared). OSWorld extends this to full operating system control, with evaluation scripts that inspect diverse artifacts after task completion: file system state, application configs, database contents, and UI element properties.

Browser use agents require a balance between token efficiency and latency. DOM-based interactions execute quickly but consume many tokens, while screenshot-based interactions are slower but more token-efficient. For example, when asking Claude to summarize Wikipedia, it is more efficient to extract the text from the DOM. When finding a new laptop case on Amazon, it is more efficient to take screenshots (as extracting the entire DOM is token-intensive). In our Claude for Chrome product, we developed evals to check that the agent was selecting the right tool for each context. This enabled us to complete browser-based tasks faster and more accurately.

计算机使用代理
计算机使用代理通过与人相同的界面与软件交互——截图、鼠标点击、键盘输入和滚动——而非通过API或代码执行。它们能操作任何带图形用户界面（GUI）的应用，从设计工具到遗留企业软件皆可。评估需在真实或沙盒环境中运行代理，观察其使用软件应用后是否达成预期目标。例如，WebArena测试基于浏览器的任务时，既用URL和页面状态检查导航准确性，又通过后端状态验证数据修改类任务（确认订单真实生成，而非仅显示确认页面）。OSWorld将评估扩展至完整操作系统控制，任务完成后检查多类痕迹：文件系统状态、应用配置、数据库内容及UI元素属性。

浏览器代理需平衡令牌效率与延迟。基于DOM的交互执行快但耗令牌多，而基于截图的交互较慢却更省令牌。例如让Claude总结维基百科时，从DOM提取文本更高效；在亚马逊找笔记本保护套时，截图方式更优（因提取整个DOM极耗令牌）。我们在Claude for Chrome产品中开发了评估机制，确保代理能根据场景选择合适工具，从而更快速精准地完成浏览器任务。

How to think about non-determinism in evaluations for agents

Regardless of agent type, agent behavior varies between runs, which makes evaluation results harder to interpret than they first appear. Each task has its own success rate—maybe 90% on one task, 50% on another—and a task that passed on one eval run might fail on the next. Sometimes, what we want to measure is how often (what proportion of the trials) an agent succeeds for a task.

Two metrics help capture this nuance:

pass@k measures the likelihood that an agent gets at least one correct solution in k attempts. As k increases, pass@k score rises: more “shots on goal” means higher odds of at least 1 success. A score of 50% pass@1 means that a model succeeds at half the tasks in the eval on its first try. In coding, we’re often most interested in the agent finding the solution on the first try—pass@1. In other cases, proposing many solutions is valid as long as one works.

pass^k measures the probability that all k trials succeed. As k increases, pass^k falls since demanding consistency across more trials is a harder bar to clear. If your agent has a 75% per-trial success rate and you run 3 trials, the probability of passing all three is (0.75)³ ≈ 42%. This metric especially matters for customer-facing agents where users expect reliable behavior every time.

如何思考智能体评估中的非确定性
无论智能体类型如何，其行为在不同运行间都存在差异，这使得评估结果比表面看起来更难解读。每项任务都有其成功率——可能某个任务达到90%，而另一个只有50%——在一次评估中通过的任务，下次可能失败。有时我们真正需要衡量的是智能体对某项任务的成功频次（即在多次尝试中的成功比例）。

两个指标有助于捕捉这种复杂性：

pass@k 衡量智能体在k次尝试中至少获得一次正确解的概率。随着k值增加，pass@k分数会上升：更多"射门机会"意味着至少一次成功的几率更高。50%的pass@1分数表示模型在首次尝试时能完成评估中半数任务。在编程场景中，我们通常最关注智能体首次尝试即找到解决方案的能力——即pass@1。而在其他场景中，只要有一个方案有效，提出多个解决方案也是可接受的。

pass^k 则衡量所有k次尝试均成功的概率。随着k值增加，pass^k会下降，因为要求更多次尝试都保持一致性是更高的门槛。若智能体单次尝试成功率为75%，进行3次尝试时，全部通过的概率为(0.75)³≈42%。该指标对面向客户的智能体尤为重要，因为用户期望其每次都能可靠运行。

Both metrics are useful, and which to use depends on product requirements: pass@k for tools where one success matters, pass^k for agents where consistency is essential.

Going from zero to one: a roadmap to great evals for agents

This section lays out our practical, field-tested advice for going from no evals to evals you can trust. Think of this as a roadmap for eval-driven agent development: define success early, measure it clearly, and iterate continuously.

Collect tasks for the initial eval dataset

Step 0. Start early

We see teams delay building evals because they think they need hundreds of tasks. In reality, 20-50 simple tasks drawn from real failures is a great start. After all, in early agent development, each change to the system often has a clear, noticeable impact, and this large effect size means small sample sizes suffice. More mature agents may need larger, more difficult evals to detect smaller effects, but it’s best to take the 80/20 approach in the beginning. Evals get harder to build the longer you wait. Early on, product requirements naturally translate into test cases. Wait too long and you're reverse-engineering success criteria from a live system.

Step 1. Start with what you already test manually

Begin with the manual checks you run during development—the behaviors you verify before each release and common tasks end users try. If you're already in production, look at your bug tracker and support queue. Converting user-reported failures into test cases ensures your suite reflects actual usage; prioritizing by user impact helps you invest effort where it counts.

Step 2: Write unambiguous tasks with reference solutions

Getting task quality right is harder than it seems. A good task is one where two domain experts would independently reach the same pass/fail verdict. Could they pass the task themselves? If not, the task needs refinement. Ambiguity in task specifications becomes noise in metrics. The same applies to criteria for model-based graders: vague rubrics produce inconsistent judgments.

Each task should be passable by an agent that follows instructions correctly. This can be subtle. For instance, auditing Terminal-Bench revealed that if a task asks the agent to write a script but doesn’t specify a filepath, and the tests assume a particular filepath for the script, the agent might fail through no fault of its own. Everything the grader checks should be clear from the task description; agents shouldn’t fail due to ambiguous specs. With frontier models, a 0% pass rate across many trials (i.e. 0% pass@100) is most often a signal of a broken task, not an incapable agent, and a sign to double-check your task specification and graders. For each task, it’s useful to create a reference solution: a known working output that passes all graders. This proves that the task is solvable and verifies graders are correctly configured.

Step 3: Build balanced problem sets

Test both the cases where a behavior should occur and where it shouldn't. One-sided evals create one-sided optimization. For instance, if you only test whether the agent searches when it should, you might end up with an agent that searches for almost everything. Try to avoid class-imbalanced evals. We learned this firsthand when building evals for web search in Claude.ai. The challenge was preventing the model from searching when it shouldn’t, while preserving its ability to do extensive research when appropriate. The team built evals covering both directions: queries where the model should search (like finding the weather) and queries where it should answer from existing knowledge (like “who founded Apple?”). Striking the right balance between undertriggering (not searching when it should) or overtriggering (searching when it shouldn’t) was difficult, and took many rounds of refinements to both the prompts and the eval. As more example problems come up, we continue to add to evals to improve our coverage.

Design the eval harness and graders

Step 4: Build a robust eval harness with a stable environment

It’s essential that the agent in the eval functions roughly the same as the agent used in production, and that the environment itself doesn’t introduce further noise. Each trial should be “isolated” by starting from a clean environment. Unnecessary shared state between runs (leftover files, cached data, resource exhaustion) can cause correlated failures due to infrastructure flakiness rather than agent performance. Shared state can also artificially inflate performance. For example, in some internal evals we observed Claude gaining an unfair advantage on some tasks by examining the git history from previous trials. If multiple distinct trials fail because of the same limitation in the environment (like limited CPU memory), these trials are not independent because they’re affected by the same factor, and the eval results become unreliable for measuring agent performance.

Step 5: Design graders thoughtfully

As discussed above, great eval design involves choosing the best graders for the agent and the tasks. We recommend choosing deterministic graders where possible, LLM graders where necessary or for additional flexibility, and using human graders judiciously for additional validation.

There is a common instinct to check that agents followed very specific steps like a sequence of tool calls in the right order. We’ve found this approach too rigid and results in overly brittle tests, as agents regularly find valid approaches that eval designers didn’t anticipate. So as not to unnecessarily punish creativity, it’s often better to grade what the agent produced, not the path it took.

For tasks with multiple components, build in partial credit. A support agent that correctly identifies the problem and verifies the customer but fails to process a refund is meaningfully better than one that fails immediately. It’s important to represent this continuum of success in results.

Model grading often takes careful iteration to validate accuracy. LLM-as-judge graders should be closely calibrated with human experts to gain confidence that there is little divergence between the human grading and model grading. To avoid hallucinations, give the LLM a way out, like providing an instruction to return “Unknown” when it doesn’t have enough information. It can also help to create clear, structured rubrics to grade each dimension of a task, and then grade each dimension with an isolated LLM-as-judge rather than using one to grade all dimensions. Once the system is robust, it’s sufficient to use human review only occasionally.

Some evaluations have subtle failure modes that result in low scores even with good agent performance, as the agent fails to solve tasks due to grading bugs, agent harness constraints, or ambiguity. Even sophisticated teams can miss these issues. For example, Opus 4.5 initially scored 42% on CORE-Bench, until an Anthropic researcher found multiple issues: rigid grading that penalized “96.12” when expecting “96.124991…”, ambiguous task specs, and stochastic tasks that were impossible to reproduce exactly. After fixing bugs and using a less constrained scaffold, Opus 4.5’s score jumped to 95%. Similarly, METR discovered several misconfigured tasks in their time horizon benchmark that asked agents to optimize to a stated score threshold, but the grading required exceeding that threshold. This penalized models like Claude for following the instructions, while models that ignored the stated goal received better scores. Carefully double-checking tasks and graders can help avoid these problems.

Make your graders resistant to bypasses or hacks. The agent shouldn’t be able to easily “cheat” the eval. Tasks and graders should be designed so that passing genuinely requires solving the problem rather than exploiting unintended loopholes.

Maintain and use the eval long-term

Step 6: Check the transcripts

You won't know if your graders are working well unless you read the transcripts and grades from many trials. At Anthropic, we invested in tooling for viewing eval transcripts and we regularly take the time to read them. When a task fails, the transcript tells you whether the agent made a genuine mistake or whether your graders rejected a valid solution. It also often surfaces key details about agent and eval behavior.

Failures should seem fair: it’s clear what the agent got wrong and why. When scores don’t climb, we need confidence that it’s due to agent performance and not the eval. Reading transcripts is how you verify that your eval is measuring what actually matters, and is a critical skill for agent development.

Step 7: Monitor for capability eval saturation

An eval at 100% tracks regressions but provides no signal for improvement. Eval saturation occurs when an agent passes all of the solvable tasks, leaving no room for improvement. For instance, SWE-Bench Verified scores started at 30% this year, and frontier models are now nearing saturation at >80%. As evals approach saturation, progress will also slow, as only the most difficult tasks remain. This can make results deceptive, as large capability improvements appear as small increases in scores. For example, the code review startup Qodo was initially unimpressed by Opus 4.5 because their one-shot coding evals didn’t capture the gains on longer, more complex tasks. In response, they developed a new agentic eval framework, providing a much clearer picture of progress.

As a rule, we do not take eval scores at face value until someone digs into the details of the eval and reads some transcripts. If grading is unfair, tasks are ambiguous, valid solutions are penalized, or the harness constrains the model, the eval should be revised.

Step 8: Keep evaluation suites healthy long-term through open contribution and maintenance

An eval suite is a living artifact that needs ongoing attention and clear ownership to remain useful.

At Anthropic, we experimented with various approaches to eval maintenance. What proved most effective was establishing dedicated evals teams to own the core infrastructure, while domain experts and product teams contribute most eval tasks and run the evaluations themselves.

For AI product teams, owning and iterating on evaluations should be as routine as maintaining unit tests. Teams can waste weeks on AI features that “work” in early testing but fail to meet unstated expectations that a well-designed eval would have surfaced early. Defining eval tasks is one of the best ways to stress-test whether the product requirements are concrete enough to start building.

We recommend practicing eval-driven development: build evals to define planned capabilities before agents can fulfill them, then iterate until the agent performs well. Internally, we often build features that work “well enough” today but are bets on what models can do in a few months. Capability evals that start at a low pass rate make this visible. When a new model drops, running the suite quickly reveals which bets paid off.

The people closest to product requirements and users are best positioned to define success. With current model capabilities, product managers, customer success managers, or salespeople can use Claude Code to contribute an eval task as a PR—let them! Or, even better, actively enable them.

How evals fit with other methods for a holistic understanding of agents

Automated evaluations can be run against an agent in thousands of tasks without deploying to production or affecting real users. But this is just one of many ways to understand agent performance. A complete picture includes production monitoring, user feedback, A/B testing, manual transcript review, and systematic human evaluation.

An overview of approaches for understanding AI agent performance

Method	Pros	Cons
Automated evals Running tests programmatically without real users	Faster iteration Fully reproducible No user impact Can run on every commit Tests scenarios at scale without requiring a prod deployment	Requires more up-front investment to build Requires ongoing maintenance as product and model evolves to avoid drift Can create false confidence if it doesn’t match real usage patterns
Production monitoring Tracking metrics and errors in live systems	Reveals real user behavior at scale Catches issues that synthetic evals miss Provides ground truth on how agents actually perform	Reactive; problems reach users before you know about them Signals can be noisy Requires investment in instrumentation Lacks ground truth for grading
A/B testing Comparing variants with real user traffic	Measures actual user outcomes (retention, task completion) Controls for confounds Scalable and systematic	Slow; days or weeks to reach significance and requires sufficient traffic Only tests changes you deploy Less signal on the underlying “why” for changes in metrics without being able to thoroughly review the transcripts
User feedback Explicit signals like thumbs-down or bug reports	Surfaces problems you didn't anticipate Comes with real examples from actual human users The feedback often correlates with product goals	Sparse and self-selected Skews toward severe issues Users rarely explain why something failed Not automated Relying primarily on users to catch issues can have negative user impact
Manual transcript review Humans reading through agent conversations	Builds intuition for failure modes Catches subtle quality issues automated checks miss Helps calibrate what "good" looks like and grasp details	Time-intensive Doesn't scale Coverage is inconsistent Reviewer fatigue or different reviewers can affect the signal quality Typically only gives qualitative signal rather than clear quantitative grading
Systematic human studies Structured grading of agent outputs by trained raters	Gold-standard quality judgements from multiple human raters Handles subjective or ambiguous tasks Provides signal for improving model-based graders	Relatively expensive and slow turnaround Hard to run frequently Inter-rater disagreement requires reconciliation Complex domains (legal, finance, healthcare) require human experts to conduct studies

These methods map to different stages of agent development. Automated evals are especially useful pre-launch and in CI/CD, running on each agent change and model upgrade as the first line of defense against quality problems. Production monitoring kicks in post-launch to detect distribution drift and unanticipated real-world failures. A/B testing validates significant changes once you have sufficient traffic. User feedback and transcript review are ongoing practices to fill the gaps: triage feedback constantly, sample transcripts to read weekly, and dig deeper as needed. Reserve systematic human studies for calibrating LLM graders or evaluating subjective outputs where human consensus serves as the reference standard.

The most effective teams combine these methods: automated evals for fast iteration, production monitoring for ground truth, and periodic human review for calibration.

Conclusion

Teams without evals get bogged down in reactive loops—fixing one failure, creating another, unable to distinguish real regressions from noise. Teams that invest early find the opposite: development accelerates as failures become test cases, test cases prevent regressions, and metrics replace guesswork. Evals give the whole team a clear hill to climb, turning “the agent feels worse” into something actionable. The value compounds, but only if you treat evals as a core component, not an afterthought.

The patterns vary by agent type, but the fundamentals described here are constant. Start early and don’t wait for the perfect suite. Source realistic tasks from the failures you see. Define unambiguous, robust success criteria. Design graders thoughtfully and combine multiple types. Make sure the problems are hard enough for the model. Iterate on the evaluations to improve their signal-to-noise ratio. Read the transcripts!

AI agent evaluation is still a nascent, fast-evolving field. As agents take on longer tasks, collaborate in multi-agent systems, and handle increasingly subjective work, we will need to adapt our techniques. We’ll keep sharing best practices as we learn more.

Acknowledgements

Written by Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe. We're also grateful to David Hershey, Gian Segato, Mike Merrill, Alex Shaw, Nicholas Carlini, Ethan Dixon, Pedram Navid, Jake Eaton, Alyssa Baum, Lina Tawfik, Karen Zhou, Alexander Bricken, Sam Kennedy, Robert Ying, and others for their contributions. Special thanks to the customers and partners we have learned from through collaborating on evals, including iGent, Cognition, Bolt, Sierra, Vals.ai, Macroscope, PromptLayer, Stripe, Shopify, the Terminal Bench team, and more. This work reflects the collective efforts of several teams who helped develop the practice of evaluations at Anthropic.

Appendix: Eval frameworks

Several open-source and commercial frameworks can help teams implement agent evaluations without building infrastructure from scratch. The right choice depends on your agent type, existing stack, and whether you need offline evaluation, production observability, or both.

Harbor is designed for running agents in containerized environments, with infrastructure for running trials at scale across cloud providers and a standardized format for defining tasks and graders. Popular benchmarks like Terminal-Bench 2.0 ship through the Harbor registry, making it easy to run established benchmarks along with custom eval suites.

Braintrust is a platform that combines offline evaluation with production observability and experiment tracking—useful for teams that need to both iterate during development and monitor quality in production. Its `autoevals` library includes pre-built scorers for factuality, relevance, and other common dimensions.

LangSmith offers tracing, offline and online evaluations, and dataset management with tight integration into the LangChain ecosystem. Langfuse provides similar capabilities as a self-hosted open-source alternative for teams with data residency requirements.

Arize offers Phoenix, an open-source platform for LLM tracing, debugging, and offline or online evaluations, and AX, a SaaS offering that extends Phoenix for scale, optimization and monitoring.

Many teams combine multiple tools, roll their own eval framework, or just use simple evaluation scripts as a starting point. We find that while frameworks can be a valuable way to accelerate progress and standardize, they’re only as good as the eval tasks you run through them. It’s often best to quickly pick a framework that fits your workflow, then invest your energy in the evals themselves by iterating on high-quality test cases and graders.

https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

C++内存管理终极指南：从智能指针到RAII

AtomGit开源社区

LlamaFactory v0.9.5 发布：Qwen3.5/Qwen3.6/Gemma4 全面支持，Transformers v5 兼容性正式到位

代码地址：github.com/hiyouga/LlamaFactory总体来看，LlamaFactory v0.9.5 是一个覆盖面极广、工程含量很高的版本。它的重点并不只是“新增几个模型”，而是围绕这个核心目标，把模型支持、训练框架、分布式能力、多模态处理、模板配置、CI 环境、文档说明一起往前推进了一大步。Qwen3.5Qwen3.6Gemma4FSDP2DeepSpeed量化多模态v1 训

AtomGit开源社区

AI Agent辅助代码调试：三种核心方法的逻辑关系与决策框架

AI 仅读取代码，推理逻辑错误。输入为代码文本，输出为错误位置、原因及修复建议。不依赖任何运行时环境。理解三种方法的逻辑关系比掌握具体操作更重要。它们形成基于成本与症状的层次决策结构方法成本定位目标静态分析低过滤语法/结构错误单元测试中定位函数级运行时错误集成测试高暴露模块交互与真实环境问题决策顺序是排除法：静态 → 单元 → 集成，每步检查 bug 是否消失。互补关系确保每种方法都能为其他方法提