AI Agent Harness测试体系:可靠性验证方法论


1. 标题 (Title)

这里为你精心准备了4个兼具吸引力、关键词覆盖和场景感的标题选项:

  1. 从零搭建AI Agent Harness:可靠性验证的100%落地指南
  2. 告别“黑盒”Agent:Harness测试体系下的全链路可靠性方法论
  3. AI Agent生产化必看:Harness驱动的容错、性能、一致性三维验证
  4. 从测试到发布:构建AI Agent Harness测试体系的核心步骤与最佳实践

2. 引言 (Introduction)

2.1 痛点引入 (Hook)

还在为你的AI Agent上线前的“玄学”状态发愁吗?
你有没有遇到过这种情况:Demo演示里你的Agent能完美串联大模型问答、API调用、工具调用、记忆推理,甚至能写出漂亮的业务代码片段;但一上灰度,用户那边就状况百出——要么是在需要调用CRM API查客户ID时直接失忆跳过,要么是在复杂多步骤任务中突然“思维跳转”到无关的问题上,要么是在网络波动下工具调用超时直接崩溃连“重试”提示都没有,更离谱的是连续调用相同的输入得到完全相反的输出结果?

这些不是孤立的“大模型幻觉”或“偶发bug”——这些是AI Agent可靠性不足的典型表现。和传统软件系统不同,AI Agent是由大语言模型(LLM)作为核心大脑、结合外部工具/API/知识图谱/记忆模块的混合决策系统:它既有传统软件的“确定性逻辑边界”(比如API必须返回JSON、工具调用有明确的权限校验),又有大模型带来的“非确定性探索空间”(比如对自然语言意图的理解偏差、思维链(CoT)步骤的随机性选择、多Agent协作时的冲突协调不确定性)。这种“半黑盒半白盒”的特性,让传统的单元测试、集成测试、E2E测试完全“失灵”:你没法写一个固定的assert agent.handle(input) == expected_output来验证,因为大模型的输出永远不会100%和你写的预期字符串完全一致;你也没法像测试传统微服务那样覆盖所有的API调用路径,因为Agent的思维路径是无限的。

在我前两段时间参与的一个企业级AI客服Agent项目里,我们就踩了无数的可靠性坑:Demo上线当天CEO演示给客户看时,Agent因为忘了把客户的VIP卡号提取到工具调用的参数里,导致查不到VIP权限直接推荐了通用产品,客户当场脸色就变了;上线灰度10%用户后,我们发现有23%的用户投诉Agent在处理退款申请这类多步骤高敏感任务时“中途断片”,不是忘填退款金额的单位就是搞错了退款渠道;更头疼的是,连续三天的日志显示,有17次相同的“退款金额1000元人民币,订单号ORD123456”输入,Agent居然给出了12种不同的处理路径——有的直接调用退款API,有的要先确认订单是否签收,有的还要打电话(我们根本没给客服Agent开放电话工具的权限!),有的甚至直接拒绝了退款说“金额超过1000元需要人工审核,但我们的VIP规则里明明写了5000元以下可以自动审核”。

那段时间我们团队每天加班到凌晨,靠“人工抽查日志+紧急修prompt+临时加规则”来救火,但这根本不是长久之计——prompt的微调是“玄学”,规则加得越多Agent越“死板”,人工抽查的覆盖率连1%都不到。后来我们公司的CTO提议:“我们能不能给AI Agent也做一个像传统CI/CD Pipeline那样的测试Harness(测试 harness,又称测试脚手架或测试 harness 框架)?专门用来验证Agent的可靠性,不是靠人工,而是靠自动化的方法,覆盖所有的核心场景,甚至可以模拟各种极端情况(比如网络波动、API超时、工具调用失败、用户输入恶意prompt)?”

2.2 文章内容概述 (What)

这篇文章就是我们团队在那个AI客服Agent项目里踩坑、探索、实践后总结出来的完整的AI Agent Harness测试体系与可靠性验证方法论

我们不会只给你讲一些空泛的“大道理”,也不会只给你扔一段看不懂的代码——我们会从核心概念讲起,让你明白什么是AI Agent的“可靠性”,它和传统软件的可靠性有什么本质区别;然后我们会带你从零搭建一个轻量级但功能完整的AI Agent Harness测试框架,覆盖从环境准备、测试用例设计、测试执行、结果分析到CI/CD集成的全流程;接着我们会深入探讨可靠性验证的三大核心维度——容错性、性能、一致性,每个维度都会有具体的测试方法、代码示例、数学模型和最佳实践;最后我们会讲一些进阶的话题,比如如何测试多Agent协作的可靠性,如何做可靠性的持续监控,以及行业内AI Agent测试的发展趋势。

2.3 读者收益 (Why)

读完这篇文章,你将能够:

  1. 彻底理解AI Agent的“可靠性”定义,不再把它和“大模型准确率”“工具调用成功率”混为一谈;
  2. 从零搭建一个属于自己的AI Agent Harness测试框架,不管你的Agent是用LangChain、AutoGen、CrewAI还是自己手写的;
  3. 掌握可靠性验证的三大核心维度的具体方法,包括容错性测试(如何模拟API超时、网络波动、工具调用失败、用户输入恶意prompt)、性能测试(如何测试Agent的响应时间、吞吐量、资源占用)、一致性测试(如何验证Agent的输出和思维路径的一致性);
  4. 把AI Agent Harness测试集成到你的CI/CD Pipeline里,实现“每次代码/prompt修改自动触发可靠性测试”;
  5. 了解行业内AI Agent测试的最佳实践和发展趋势,为你的AI Agent生产化保驾护航。

3. 准备工作 (Prerequisites)

在我们开始之前,你需要具备以下的知识、环境和工具:

3.1 技术栈/知识

  1. 熟悉Python基础(>=3.10):包括类、函数、装饰器、异步编程(async/await)——因为我们的Harness测试框架和示例Agent都会用Python写;
  2. 了解大语言模型(LLM)的基本概念:包括Prompt Engineering(提示词工程)、思维链(CoT)、Function Calling(工具调用);
  3. 熟悉至少一个AI Agent开发框架:比如LangChain(最流行的)、AutoGen(微软的多Agent协作框架)、CrewAI(专门用于团队协作的Agent框架)——我们的示例会用LangChain v0.2.x,因为它的社区最活跃,文档最完善;
  4. 了解传统软件测试的基本概念:比如单元测试、集成测试、E2E测试、测试用例、断言——虽然这些不能直接用在AI Agent上,但它们的思想是通用的;
  5. 了解CI/CD的基本概念:比如GitHub Actions、GitLab CI/CD——最后我们会把Harness测试集成到GitHub Actions里。

3.2 环境/工具

  1. 已安装Python 3.10或更高版本:你可以从Python官网下载安装;
  2. 已安装pip或conda:用来管理Python的依赖包;
  3. 已注册并获取至少一个LLM的API Key:比如OpenAI GPT-3.5-turbo-1106/GPT-4o、Anthropic Claude 3 Haiku/Sonnet、百度文心一言4.0、阿里通义千问2.5——我们的示例会用OpenAI GPT-3.5-turbo-1106,因为它的Function Calling最稳定,价格也比较便宜;
  4. 已注册并获取至少一个外部工具的API Key:比如天气查询API(OpenWeatherMap)、股票查询API(Alpha Vantage)、翻译API(DeepL)——我们的示例会用OpenWeatherMap的免费天气查询API,因为它的注册和使用都很简单;
  5. 已安装一个代码编辑器或IDE:比如VS Code(推荐,配合Python插件)、PyCharm;
  6. 已安装Git:用来管理代码(可选,但推荐);
  7. 已注册一个GitHub账号:用来最后集成CI/CD(可选,但推荐)。

4. 核心概念:重新定义AI Agent的“可靠性”

在我们开始搭建Harness测试框架之前,我们必须先搞清楚什么是AI Agent的“可靠性”——因为如果连定义都搞不清楚,我们的测试就是“无源之水,无本之木”。

4.1 问题背景:传统软件可靠性 vs AI Agent可靠性

我们先来看一下传统软件的可靠性定义:根据IEEE的标准,传统软件的可靠性是“在规定的条件下、在规定的时间内、完成规定的功能的概率”。传统软件是“白盒系统”——它的所有逻辑都是由程序员写死的,所有的输入输出路径都是可以被枚举的(至少在理论上是这样),所以传统软件的可靠性测试可以用“覆盖所有输入输出路径”“验证所有断言”的方法来完成。

举个例子:假设我们有一个传统的电商退款系统,它的逻辑是这样的:

  1. 接收用户的退款请求(包含订单号、退款金额、退款理由);
  2. 校验订单号是否存在;
  3. 校验订单是否已经签收;
  4. 校验退款金额是否小于等于订单金额;
  5. 如果用户是VIP,校验退款金额是否小于等于5000元;
  6. 如果以上校验都通过,调用支付平台的退款API;
  7. 如果退款API返回成功,更新订单状态为“已退款”,给用户发送退款成功的短信;
  8. 如果退款API返回失败,更新订单状态为“退款失败”,给用户发送退款失败的短信,提示用户联系客服。

这个电商退款系统的所有逻辑都是确定的——只要输入相同的退款请求,不管执行多少次,输出都是完全一样的;只要我们覆盖了所有的输入输出路径(比如订单号不存在、订单未签收、退款金额超过订单金额、普通用户退款5001元、VIP用户退款5000元、退款API成功、退款API失败),我们就可以认为这个系统是“可靠的”。

AI Agent的可靠性定义完全不同——因为AI Agent是由LLM作为核心大脑的“半黑盒半白盒系统”。我们来看一下AI Agent的典型架构(用Mermaid架构图表示):

渲染错误: Mermaid 渲染失败: Parsing failed: Lexer error on line 2, column 21: unexpected character: ->[<- at offset: 38, skipped 6 characters. Lexer error on line 3, column 20: unexpected character: ->[<- at offset: 64, skipped 6 characters. Lexer error on line 3, column 30: unexpected character: ->]<- at offset: 74, skipped 1 characters. Lexer error on line 4, column 29: unexpected character: ->[<- at offset: 104, skipped 6 characters. Lexer error on line 5, column 20: unexpected character: ->[<- at offset: 130, skipped 3 characters. Lexer error on line 5, column 26: unexpected character: ->]<- at offset: 136, skipped 1 characters. Lexer error on line 6, column 21: unexpected character: ->[<- at offset: 158, skipped 6 characters. Lexer error on line 7, column 19: unexpected character: ->[<- at offset: 183, skipped 5 characters. Lexer error on line 7, column 27: unexpected character: ->]<- at offset: 191, skipped 1 characters. Lexer error on line 8, column 26: unexpected character: ->[<- at offset: 218, skipped 6 characters. Lexer error on line 8, column 42: unexpected character: ->]<- at offset: 234, skipped 1 characters. Lexer error on line 9, column 25: unexpected character: ->[<- at offset: 260, skipped 6 characters. Lexer error on line 10, column 27: unexpected character: ->[<- at offset: 293, skipped 5 characters. Lexer error on line 10, column 50: unexpected character: ->]<- at offset: 316, skipped 1 characters. Lexer error on line 11, column 26: unexpected character: ->[<- at offset: 343, skipped 5 characters. Lexer error on line 11, column 48: unexpected character: ->]<- at offset: 365, skipped 1 characters. Lexer error on line 12, column 25: unexpected character: ->[<- at offset: 391, skipped 5 characters. Lexer error on line 12, column 33: unexpected character: ->]<- at offset: 399, skipped 1 characters. Lexer error on line 13, column 24: unexpected character: ->[<- at offset: 424, skipped 5 characters. Lexer error on line 13, column 37: unexpected character: ->]<- at offset: 437, skipped 1 characters. Lexer error on line 14, column 28: unexpected character: ->[<- at offset: 466, skipped 1 characters. Lexer error on line 14, column 32: unexpected character: ->网<- at offset: 470, skipped 2 characters. Lexer error on line 14, column 46: unexpected character: ->]<- at offset: 484, skipped 1 characters. Lexer error on line 16, column 23: unexpected character: ->自<- at offset: 509, skipped 6 characters. Lexer error on line 17, column 27: unexpected character: ->结<- at offset: 542, skipped 10 characters. Lexer error on line 18, column 26: unexpected character: ->存<- at offset: 578, skipped 8 characters. Lexer error on line 19, column 25: unexpected character: ->存<- at offset: 611, skipped 11 characters. Lexer error on line 20, column 26: unexpected character: ->读<- at offset: 648, skipped 8 characters. Lexer error on line 21, column 25: unexpected character: ->读<- at offset: 681, skipped 11 characters. Lexer error on line 22, column 25: unexpected character: ->语<- at offset: 717, skipped 8 characters. Lexer error on line 23, column 25: unexpected character: ->返<- at offset: 750, skipped 8 characters. Lexer error on line 24, column 18: unexpected character: ->图<- at offset: 776, skipped 9 characters. Lexer error on line 25, column 18: unexpected character: ->返<- at offset: 803, skipped 9 characters. Lexer error on line 26, column 20: unexpected character: ->调<- at offset: 832, skipped 19 characters. Lexer error on line 27, column 20: unexpected character: ->返<- at offset: 871, skipped 8 characters. Lexer error on line 28, column 19: unexpected character: ->调<- at offset: 898, skipped 4 characters. Lexer error on line 28, column 26: unexpected character: ->(<- at offset: 905, skipped 15 characters. Lexer error on line 28, column 44: unexpected character: ->)<- at offset: 923, skipped 1 characters. Lexer error on line 29, column 19: unexpected character: ->返<- at offset: 943, skipped 2 characters. Lexer error on line 29, column 24: unexpected character: ->执<- at offset: 948, skipped 4 characters. Lexer error on line 30, column 23: unexpected character: ->自<- at offset: 975, skipped 6 characters. Lexer error on line 31, column 27: unexpected character: ->结<- at offset: 1008, skipped 10 characters. Parse error on line 3, column 27: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'L' Parse error on line 3, column 31: Expecting token of type ':' but found ` `. Parse error on line 5, column 23: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'API' Parse error on line 5, column 27: Expecting token of type ':' but found ` `. Parse error on line 7, column 25: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'KG' Parse error on line 7, column 28: Expecting token of type ':' but found ` `. Parse error on line 8, column 33: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'Vector' Parse error on line 8, column 40: Expecting token of type ':' but found `DB`. Parse error on line 10, column 33: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'Short-term' Parse error on line 10, column 44: Expecting token of type ':' but found `Memory`. Parse error on line 11, column 32: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'L' Parse error on line 11, column 42: Expecting token of type ':' but found `Memory`. Parse error on line 12, column 31: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'UI' Parse error on line 12, column 34: Expecting token of type ':' but found ` `. Parse error on line 13, column 30: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'Chat' Parse error on line 13, column 35: Expecting token of type ':' but found `UI`. Parse error on line 14, column 29: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'API' Parse error on line 14, column 35: Expecting token of type ':' but found `API`. Parse error on line 14, column 39: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'Gateway' Parse error on line 14, column 47: Expecting token of type ':' but found ` `. Parse error on line 16, column 13: Expecting token of type ':' but found `--`. Parse error on line 16, column 17: Expecting token of type 'ARROW_DIRECTION' but found `llm`. Parse error on line 16, column 21: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 17, column 17: Expecting token of type ':' but found `--`. Parse error on line 17, column 21: Expecting token of type 'ARROW_DIRECTION' but found `llm`. Parse error on line 17, column 25: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 18, column 9: Expecting token of type ':' but found `--`. Parse error on line 18, column 13: Expecting token of type 'ARROW_DIRECTION' but found `short_term`. Parse error on line 18, column 24: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 19, column 9: Expecting token of type ':' but found `--`. Parse error on line 19, column 13: Expecting token of type 'ARROW_DIRECTION' but found `long_term`. Parse error on line 19, column 23: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 20, column 16: Expecting token of type ':' but found `--`. Parse error on line 20, column 20: Expecting token of type 'ARROW_DIRECTION' but found `llm`. Parse error on line 20, column 24: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 21, column 15: Expecting token of type ':' but found `--`. Parse error on line 21, column 19: Expecting token of type 'ARROW_DIRECTION' but found `llm`. Parse error on line 21, column 23: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 22, column 9: Expecting token of type ':' but found `--`. Parse error on line 22, column 13: Expecting token of type 'ARROW_DIRECTION' but found `vector_db`. Parse error on line 22, column 23: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 23, column 15: Expecting token of type ':' but found `--`. Parse error on line 23, column 19: Expecting token of type 'ARROW_DIRECTION' but found `llm`. Parse error on line 23, column 23: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 24, column 9: Expecting token of type ':' but found `--`. Parse error on line 24, column 13: Expecting token of type 'ARROW_DIRECTION' but found `kg`. Parse error on line 24, column 16: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 25, column 8: Expecting token of type ':' but found `--`. Parse error on line 25, column 12: Expecting token of type 'ARROW_DIRECTION' but found `llm`. Parse error on line 25, column 16: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 26, column 9: Expecting token of type ':' but found `--`. Parse error on line 26, column 13: Expecting token of type 'ARROW_DIRECTION' but found `tool`. Parse error on line 26, column 18: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 27, column 10: Expecting token of type ':' but found `--`. Parse error on line 27, column 14: Expecting token of type 'ARROW_DIRECTION' but found `llm`. Parse error on line 27, column 18: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 28, column 9: Expecting token of type ':' but found `--`. Parse error on line 28, column 13: Expecting token of type 'ARROW_DIRECTION' but found `api`. Parse error on line 28, column 17: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 28, column 41: Expecting token of type ':' but found `API`. Parse error on line 29, column 9: Expecting token of type ':' but found `--`. Parse error on line 29, column 13: Expecting token of type 'ARROW_DIRECTION' but found `llm`. Parse error on line 29, column 17: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 29, column 28: Expecting token of type ':' but found ` `. Parse error on line 30, column 9: Expecting token of type ':' but found `--`. Parse error on line 30, column 13: Expecting token of type 'ARROW_DIRECTION' but found `chat_ui`. Parse error on line 30, column 21: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 31, column 9: Expecting token of type ':' but found `--`. Parse error on line 31, column 13: Expecting token of type 'ARROW_DIRECTION' but found `api_gateway`. Parse error on line 31, column 25: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':'

从这个架构图可以看出,AI Agent的决策过程受到多个因素的影响

  1. LLM的非确定性:LLM的输出是基于概率的——即使输入相同的prompt,LLM的输出也可能会有细微的差别(除非你把temperature参数设置为0,但即使设置为0,不同版本的LLM或者不同的部署环境也可能会有微小的差别);更重要的是,LLM对自然语言意图的理解是有偏差的——比如用户输入“我要退款”,LLM可能会理解为“我要查退款进度”,也可能会理解为“我要取消退款申请”,还可能会理解为“我要投诉退款慢”;
  2. 外部资源的不确定性:外部API可能会超时、返回错误、返回错误的数据格式;内部工具可能会有bug;向量数据库可能会检索到不相关的知识片段;知识图谱可能会有缺失的实体或关系;
  3. 记忆模块的不确定性:短期记忆可能会因为会话过长而丢失重要的信息;长期记忆可能会因为数据量过大而检索不到相关的历史信息;
  4. 用户输入的不确定性:用户可能会输入恶意prompt(比如Prompt Injection、Prompt Leaking),可能会输入模糊不清的prompt,可能会输入矛盾的prompt,可能会在会话中途突然改变话题。

这些因素的存在,使得AI Agent的决策路径是无限的——我们根本不可能覆盖所有的输入输出路径;同时,AI Agent的输出也不可能100%和我们写的预期字符串完全一致——我们需要的是“语义上的正确”,而不是“字面上的正确”。

4.2 问题描述:AI Agent可靠性的三大痛点

基于以上的分析,我们可以总结出AI Agent可靠性验证的三大核心痛点:

  1. 痛点一:输入输出路径无限,无法用传统的覆盖测试方法:传统软件的测试覆盖率可以达到90%以上,甚至100%;但AI Agent的测试覆盖率连0.1%都不到——因为我们根本不知道用户会输入什么,也不知道Agent会走哪条决策路径;
  2. 痛点二:输出是“语义上的正确”,无法用传统的断言方法:传统软件的断言是“assert actual_output == expected_output”或者“assert actual_output in expected_output_list”;但AI Agent的断言需要判断“actual_output是否在语义上满足我们的要求”——比如我们要求Agent“在用户输入退款申请时,必须先确认订单号是否存在”,但Agent可能会说“麻烦您提供一下订单号,我好帮您查询一下退款进度”,也可能会说“请问您的订单号是什么?我需要先核对一下您的订单信息”,这两个输出在字面上完全不同,但在语义上都是正确的;
  3. 痛点三:非确定性因素太多,无法用传统的重复测试方法:传统软件的重复测试是“执行100次相同的输入,输出必须完全一致”;但AI Agent的重复测试是“执行100次相同的输入,输出必须在语义上一致,决策路径必须在我们允许的范围内”——比如我们允许Agent在处理退款申请时,要么先确认订单号是否存在,要么先确认用户是否是VIP,但不能直接调用退款API,也不能调用电话工具(因为我们没给它开放权限)。

4.3 问题解决:AI Agent可靠性的三维度定义

为了解决以上三大痛点,我们必须重新定义AI Agent的“可靠性”。我们团队经过反复的讨论和实践,把AI Agent的可靠性定义为**“在规定的输入空间、规定的外部环境、规定的时间内、以规定的概率、完成规定的核心任务的能力”,并且把它拆分为三大核心维度**:容错性(Robustness)、性能(Performance)、一致性(Consistency)。

4.3.1 维度一:容错性(Robustness)

容错性是指AI Agent在面对非理想输入、非理想外部环境时,仍然能够正常工作的能力。容错性又可以进一步拆分为以下四个子维度:

  1. 输入容错性(Input Robustness):Agent在面对模糊不清的输入、矛盾的输入、拼写错误的输入、格式错误的输入、恶意prompt(Prompt Injection、Prompt Leaking)时,仍然能够正确理解用户的意图,或者优雅地拒绝用户的请求,不会出现“思维跳转”“输出错误信息”“泄露内部prompt”等情况;
  2. 外部资源容错性(External Resource Robustness):Agent在面对外部API超时、外部API返回错误、外部API返回错误的数据格式、内部工具bug、向量数据库检索不到相关知识、知识图谱缺失实体/关系时,仍然能够正常工作,或者优雅地降级(比如用缓存的数据、用默认的参数、提示用户稍后重试),不会出现“直接崩溃”“输出空值”“输出错误数据”等情况;
  3. 记忆容错性(Memory Robustness):Agent在面对短期记忆过长(比如会话超过100轮)、长期记忆数据量过大(比如超过100万条历史记录)时,仍然能够正确检索到相关的记忆信息,不会出现“失忆”“思维混乱”等情况;
  4. 任务容错性(Task Robustness):Agent在面对用户突然改变话题、用户中途打断任务、用户提出的任务超出Agent的能力范围时,仍然能够正常工作,或者优雅地拒绝用户的请求,不会出现“思维跳转”“继续执行之前的任务”等情况。
4.3.2 维度二:性能(Performance)

性能是指AI Agent在规定的时间内、以规定的资源占用、完成规定的任务的能力。性能又可以进一步拆分为以下四个子维度:

  1. 响应时间(Response Time):Agent从接收到用户的输入到返回输出的时间——这个时间必须在用户可接受的范围内(比如普通聊天任务<2秒,复杂多步骤任务<10秒);
  2. 吞吐量(Throughput):Agent在单位时间内能够处理的用户请求数量——这个数量必须满足业务的需求(比如每秒100个请求);
  3. 资源占用(Resource Usage):Agent在处理用户请求时占用的CPU、内存、GPU、网络带宽——这个占用必须在服务器的可承受范围内;
  4. 可扩展性(Scalability):Agent的性能是否能够随着服务器资源的增加而线性提升——比如增加1台服务器,吞吐量是否能够提升接近1倍。
4.3.3 维度三:一致性(Consistency)

一致性是指AI Agent在面对相同的输入、相同的外部环境时,输出和决策路径在语义上一致的能力。一致性又可以进一步拆分为以下三个子维度:

  1. 输出语义一致性(Output Semantic Consistency):Agent在面对相同的输入、相同的外部环境时,输出的自然语言或结构化数据在语义上必须一致——比如不能连续两次输入“退款金额1000元人民币,订单号ORD123456”,第一次输出“我已经帮您提交了退款申请,预计1-3个工作日到账”,第二次输出“您的退款申请已经被拒绝,因为金额超过1000元需要人工审核”;
  2. 决策路径一致性(Decision Path Consistency):Agent在面对相同的输入、相同的外部环境时,决策路径必须在我们允许的范围内——比如我们允许Agent在处理退款申请时,要么先确认订单号是否存在,要么先确认用户是否是VIP,但不能直接调用退款API,也不能调用电话工具;
  3. 跨版本一致性(Cross-Version Consistency):Agent在升级LLM版本、升级Agent框架版本、修改prompt、修改规则后,输出和决策路径在语义上必须保持一致(至少在核心场景上是这样)——比如我们不能因为升级了GPT-4o,就导致Agent在处理退款申请时突然开始调用电话工具。

4.4 边界与外延:AI Agent可靠性的范围

在我们继续之前,我们必须明确AI Agent可靠性的边界与外延——也就是说,什么是我们需要测试的,什么是我们不需要测试的。

4.4.1 边界:AI Agent可靠性的测试范围

我们需要测试的是AI Agent的“端到端可靠性”——也就是说,从用户的输入到Agent的输出的整个链路的可靠性,包括:

  1. 用户输入的解析:Agent是否能够正确理解用户的意图;
  2. 记忆的检索与存储:Agent是否能够正确检索到相关的记忆信息,是否能够正确存储当前的会话信息;
  3. 知识的检索:Agent是否能够正确检索到相关的知识片段(来自向量数据库或知识图谱);
  4. 工具/API的调用:Agent是否能够正确调用需要的工具/API,是否能够正确提取参数,是否能够正确处理工具/API的返回结果;
  5. 思维链的生成:Agent是否能够生成合理的思维链,是否能够按照思维链的步骤执行任务;
  6. 输出的生成:Agent是否能够生成符合要求的输出(自然语言或结构化数据)。
4.4.2 外延:AI Agent可靠性之外的内容

我们不需要测试的是LLM本身的可靠性——也就是说,LLM的“基础准确率”“基础推理能力”“基础知识储备”不是我们的测试范围,因为这些是LLM厂商需要保证的。我们的测试范围是**“我们自己构建的Agent系统”的可靠性**——也就是说,我们如何利用LLM的能力,如何结合外部资源、记忆模块、规则,来构建一个可靠的Agent系统。

举个例子:假设我们的Agent是用GPT-3.5-turbo-1106构建的AI客服Agent,我们不需要测试“GPT-3.5-turbo-1106是否能够正确回答‘中国的首都是哪里’这个问题”——因为这是OpenAI需要保证的;但我们需要测试“我们的AI客服Agent是否能够在用户问‘中国的首都是哪里’这个问题时,正确判断这是一个‘无关业务的问题’,然后优雅地拒绝用户的请求,或者引导用户回到业务问题上”——因为这是我们自己构建的Agent系统的功能。

4.5 概念结构与核心要素组成:AI Agent Harness测试体系

为了验证AI Agent的三大核心维度的可靠性,我们需要构建一个AI Agent Harness测试体系。这个测试体系的概念结构与核心要素组成如下图所示(用Mermaid架构图表示):

渲染错误: Mermaid 渲染失败: Parsing failed: Lexer error on line 2, column 30: unexpected character: ->[<- at offset: 47, skipped 1 characters. Lexer error on line 2, column 47: unexpected character: ->测<- at offset: 64, skipped 5 characters. Lexer error on line 3, column 31: unexpected character: ->[<- at offset: 100, skipped 6 characters. Lexer error on line 4, column 39: unexpected character: ->[<- at offset: 145, skipped 8 characters. Lexer error on line 5, column 37: unexpected character: ->[<- at offset: 190, skipped 8 characters. Lexer error on line 6, column 34: unexpected character: ->[<- at offset: 232, skipped 9 characters. Lexer error on line 7, column 29: unexpected character: ->[<- at offset: 270, skipped 6 characters. Lexer error on line 8, column 36: unexpected character: ->[<- at offset: 312, skipped 7 characters. Lexer error on line 9, column 42: unexpected character: ->[<- at offset: 361, skipped 7 characters. Lexer error on line 10, column 33: unexpected character: ->[<- at offset: 401, skipped 1 characters. Lexer error on line 10, column 39: unexpected character: ->运<- at offset: 407, skipped 4 characters. Lexer error on line 11, column 37: unexpected character: ->[<- at offset: 448, skipped 7 characters. Lexer error on line 12, column 30: unexpected character: ->[<- at offset: 485, skipped 6 characters. Lexer error on line 13, column 39: unexpected character: ->[<- at offset: 530, skipped 7 characters. Lexer error on line 14, column 34: unexpected character: ->[<- at offset: 571, skipped 7 characters. Lexer error on line 15, column 41: unexpected character: ->[<- at offset: 619, skipped 7 characters. Lexer error on line 16, column 37: unexpected character: ->[<- at offset: 663, skipped 7 characters. Lexer error on line 17, column 32: unexpected character: ->[<- at offset: 702, skipped 1 characters. Lexer error on line 17, column 35: unexpected character: ->/<- at offset: 705, skipped 1 characters. Lexer error on line 17, column 38: unexpected character: ->集<- at offset: 708, skipped 3 characters. Lexer error on line 19, column 30: unexpected character: ->[<- at offset: 792, skipped 1 characters. Lexer error on line 19, column 40: unexpected character: ->/<- at offset: 802, skipped 1 characters. Lexer error on line 19, column 43: unexpected character: ->]<- at offset: 805, skipped 1 characters. Lexer error on line 22, column 43: unexpected character: ->提<- at offset: 887, skipped 6 characters. Lexer error on line 23, column 44: unexpected character: ->提<- at offset: 937, skipped 6 characters. Lexer error on line 24, column 42: unexpected character: ->提<- at offset: 985, skipped 9 characters. Lexer error on line 25, column 44: unexpected character: ->提<- at offset: 1038, skipped 6 characters. Lexer error on line 26, column 39: unexpected character: ->提<- at offset: 1083, skipped 9 characters. Lexer error on line 27, column 40: unexpected character: ->提<- at offset: 1132, skipped 4 characters. Lexer error on line 28, column 46: unexpected character: ->模<- at offset: 1182, skipped 9 characters. Lexer error on line 29, column 41: unexpected character: ->收<- at offset: 1232, skipped 2 characters. Lexer error on line 29, column 48: unexpected character: ->的<- at offset: 1239, skipped 13 characters. Lexer error on line 30, column 47: unexpected character: ->提<- at offset: 1299, skipped 4 characters. Lexer error on line 31, column 42: unexpected character: ->提<- at offset: 1345, skipped 6 characters. Lexer error on line 32, column 49: unexpected character: ->提<- at offset: 1400, skipped 6 characters. Lexer error on line 33, column 47: unexpected character: ->提<- at offset: 1453, skipped 8 characters. Lexer error on line 34, column 42: unexpected character: ->提<- at offset: 1503, skipped 8 characters. Lexer error on line 35, column 49: unexpected character: ->提<- at offset: 1560, skipped 8 characters. Lexer error on line 36, column 43: unexpected character: ->提<- at offset: 1611, skipped 6 characters. Lexer error on line 37, column 38: unexpected character: ->提<- at offset: 1655, skipped 6 characters. Lexer error on line 38, column 36: unexpected character: ->提<- at offset: 1697, skipped 6 characters. Parse error on line 2, column 31: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'AI' Parse error on line 2, column 34: Expecting token of type ':' but found `Agent`. Parse error on line 2, column 40: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'Harness' Parse error on line 2, column 52: Expecting token of type ':' but found ` `. Parse error on line 10, column 34: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'Agent' Parse error on line 10, column 43: Expecting token of type ':' but found ` `. Parse error on line 17, column 33: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'CI' Parse error on line 17, column 36: Expecting token of type ':' but found `CD`. Parse error on line 19, column 31: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'GitLab' Parse error on line 19, column 38: Expecting token of type ':' but found `CI`. Parse error on line 19, column 41: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: 'CD' Parse error on line 19, column 44: Expecting token of type ':' but found ` `. Parse error on line 22, column 24: Expecting token of type ':' but found `--`. Parse error on line 22, column 28: Expecting token of type 'ARROW_DIRECTION' but found `agent_runner`. Parse error on line 22, column 41: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 23, column 22: Expecting token of type ':' but found `--`. Parse error on line 23, column 26: Expecting token of type 'ARROW_DIRECTION' but found `input_generator`. Parse error on line 23, column 42: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 24, column 22: Expecting token of type ':' but found `--`. Parse error on line 24, column 26: Expecting token of type 'ARROW_DIRECTION' but found `oracle_config`. Parse error on line 24, column 40: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 25, column 19: Expecting token of type ':' but found `--`. Parse error on line 25, column 23: Expecting token of type 'ARROW_DIRECTION' but found `semantic_evaluator`. Parse error on line 25, column 42: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 26, column 19: Expecting token of type ':' but found `--`. Parse error on line 26, column 23: Expecting token of type 'ARROW_DIRECTION' but found `path_analyzer`. Parse error on line 26, column 37: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 27, column 21: Expecting token of type ':' but found `--`. Parse error on line 27, column 25: Expecting token of type 'ARROW_DIRECTION' but found `agent_runner`. Parse error on line 27, column 38: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 28, column 27: Expecting token of type ':' but found `--`. Parse error on line 28, column 31: Expecting token of type 'ARROW_DIRECTION' but found `agent_runner`. Parse error on line 28, column 44: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 29, column 18: Expecting token of type ':' but found `--`. Parse error on line 29, column 22: Expecting token of type 'ARROW_DIRECTION' but found `result_collector`. Parse error on line 29, column 39: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 29, column 61: Expecting token of type ':' but found ` `. Parse error on line 30, column 22: Expecting token of type ':' but found `--`. Parse error on line 30, column 26: Expecting token of type 'ARROW_DIRECTION' but found `semantic_evaluator`. Parse error on line 30, column 45: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 31, column 22: Expecting token of type ':' but found `--`. Parse error on line 31, column 26: Expecting token of type 'ARROW_DIRECTION' but found `path_analyzer`. Parse error on line 31, column 40: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 32, column 22: Expecting token of type ':' but found `--`. Parse error on line 32, column 26: Expecting token of type 'ARROW_DIRECTION' but found `performance_analyzer`. Parse error on line 32, column 47: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 33, column 24: Expecting token of type ':' but found `--`. Parse error on line 33, column 28: Expecting token of type 'ARROW_DIRECTION' but found `report_generator`. Parse error on line 33, column 45: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 34, column 19: Expecting token of type ':' but found `--`. Parse error on line 34, column 23: Expecting token of type 'ARROW_DIRECTION' but found `report_generator`. Parse error on line 34, column 40: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 35, column 26: Expecting token of type ':' but found `--`. Parse error on line 35, column 30: Expecting token of type 'ARROW_DIRECTION' but found `report_generator`. Parse error on line 35, column 47: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 36, column 22: Expecting token of type ':' but found `--`. Parse error on line 36, column 26: Expecting token of type 'ARROW_DIRECTION' but found `github_actions`. Parse error on line 36, column 41: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 37, column 22: Expecting token of type ':' but found `--`. Parse error on line 37, column 26: Expecting token of type 'ARROW_DIRECTION' but found `gitlab_ci`. Parse error on line 37, column 36: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':' Parse error on line 38, column 22: Expecting token of type ':' but found `--`. Parse error on line 38, column 26: Expecting token of type 'ARROW_DIRECTION' but found `jenkins`. Parse error on line 38, column 34: Expecting: one of these possible Token sequences: 1. [NEWLINE] 2. [EOF] but found: ':'

从这个架构图可以看出,AI Agent Harness测试体系由四大核心模块组成:

  1. 测试准备模块(Test Preparation):包括测试环境配置、测试用例设计、测试预言机配置;
  2. 测试执行模块(Test Execution):包括输入生成器、环境模拟器、Agent运行器、结果收集器;
  3. 结果分析模块(Result Analysis):包括语义评估器、路径分析器、性能分析器、报告生成器;
  4. CI/CD集成模块(CI/CD Integration):包括GitHub Actions、GitLab CI/CD、Jenkins等。

接下来,我们会详细讲解每个核心模块的功能、实现方法和代码示例。不过在这之前,我们先来看一下三大核心维度的可靠性之间的关系,以及它们和测试体系的核心要素之间的关系。

4.6 概念之间的关系:三大核心维度的对比与联系

4.6.1 三大核心维度的对比:核心属性维度表格

为了更清晰地理解三大核心维度的可靠性,我们可以用一个核心属性维度对比表格来表示:

核心维度 子维度 测试目标 测试方法 评估标准 优先级
容错性 输入容错性 验证Agent在面对非理想输入时的表现 生成模糊/矛盾/拼写错误/格式错误/恶意的输入,测试Agent的响应 正确理解意图的概率/优雅拒绝的概率/不泄露内部prompt的概率
外部资源容错性 验证Agent在面对非理想外部环境时的表现 模拟API超时/API返回错误/API返回错误格式/工具bug/知识检索失败,测试Agent的响应 正常工作的概率/优雅降级的概率/不崩溃的概率
记忆容错性 验证Agent在面对记忆模块异常时的表现 生成长会话/大数据量长期记忆,测试Agent的记忆检索能力 正确检索相关记忆的概率/不失忆的概率/不思维混乱的概率
任务容错性 验证Agent在面对任务异常时的表现 模拟用户突然改变话题/用户中途打断任务/用户提出超出能力范围的任务,测试Agent的响应 正常响应的概率/优雅拒绝的概率/不思维跳转的概率
性能 响应时间 验证Agent的响应速度是否满足用户需求 执行不同复杂度的任务,测量响应时间 普通任务<2秒的概率/复杂任务<10秒的概率
吞吐量 验证Agent的处理能力是否满足业务需求 模拟并发请求,测量单位时间内处理的请求数量 达到业务要求吞吐量的概率
资源占用 验证Agent的资源占用是否在服务器可承受范围内 执行不同复杂度的任务,测量CPU/内存/GPU/网络带宽的占用 资源占用不超过服务器阈值的概率
可扩展性 验证Agent的性能是否能够随着服务器资源的增加而线性提升 增加服务器资源,测量吞吐量的提升比例 吞吐量提升比例接近线性的概率(比如增加1台服务器,吞吐量提升80%以上)
一致性 输出语义一致性 验证Agent的输出在语义上是否一致 重复执行相同的输入,测量输出的语义相似度 输出语义相似度超过阈值的概率(比如超过0.8)
决策路径一致性 验证Agent的决策路径是否在允许的范围内 重复执行相同的输入,检查决策路径是否在允许的范围内 决策路径在允许范围内的概率
跨版本一致性 验证Agent升级后输出和决策路径在语义上是否一致 升级Agent后,执行核心场景的测试用例,测量输出语义相似度和检查决策路径 核心场景输出语义相似度超过阈值的概率/核心场景决策路径在允许范围内的概率
4.6.2 三大核心维度与测试体系核心要素的联系:ER实体关系图

为了更清晰地理解三大核心维度的可靠性和测试体系的核心要素之间的联系,我们可以用一个ER实体关系图来表示:

包含

包含

包含

被执行

生成

评估

RELIABILITY

string

reliability_id

PK

string

dimension

FK

float

score

datetime

test_time

ROBUSTNESS

string

robustness_id

PK

string

sub_dimension

FK

float

input_robustness_score

float

external_resource_robustness_score

float

memory_robustness_score

float

task_robustness_score

PERFORMANCE

string

performance_id

PK

string

sub_dimension

FK

float

response_time_score

float

throughput_score

float

resource_usage_score

float

scalability_score

CONSISTENCY

string

consistency_id

PK

string

sub_dimension

FK

float

output_semantic_consistency_score

float

decision_path_consistency_score

float

cross_version_consistency_score

TEST_CASE

string

test_case_id

PK

string

test_scenario

string

input_type

string

expected_semantic

string

allowed_decision_paths

int

priority

TEST_EXECUTION

string

execution_id

PK

string

test_case_id

FK

string

agent_version

datetime

start_time

datetime

end_time

float

response_time

json

output

json

decision_path

float

cpu_usage

float

memory_usage

ORACLE

string

oracle_id

PK

string

type

json

configuration

从这个ER实体关系图可以看出:

  1. RELIABILITY(可靠性) 实体包含 ROBUSTNESS(容错性)PERFORMANCE(性能)CONSISTENCY(一致性) 三个子实体;
  2. TEST_CASE(测试用例) 实体可以被多次执行,生成多个 TEST_EXECUTION(测试执行) 实体;
  3. TEST_EXECUTION(测试执行) 实体会生成一个 RELIABILITY(可靠性) 实体;
  4. ORACLE(测试预言机) 实体用来评估 RELIABILITY(可靠性) 实体的分数。
4.6.3 三大核心维度与测试体系核心要素的交互关系:Mermaid交互关系图

为了更清晰地理解三大核心维度的可靠性和测试体系的核心要素之间的交互关系,我们可以用一个Mermaid交互关系图来表示:

CI/CD模块 报告生成器 性能分析器 路径分析器 语义评估器 结果分析模块 结果收集器 被测Agent 环境模拟器 输入生成器 测试执行模块 测试环境 测试预言机 测试用例 测试准备模块 测试人员 CI/CD模块 报告生成器 性能分析器 路径分析器 语义评估器 结果分析模块 结果收集器 被测Agent 环境模拟器 输入生成器 测试执行模块 测试环境 测试预言机 测试用例 测试准备模块 测试人员 配置测试环境、设计测试用例、配置测试预言机 生成测试用例 配置评估标准 部署测试环境和被测Agent 触发测试执行 从测试用例生成输入 读取测试用例 模拟非理想外部环境(如果需要) 拦截外部资源请求并模拟响应 发送输入 返回输出、决策路径、性能数据 发送收集到的数据 发送输出 读取评估标准(预期语义) 发送语义评估结果 发送决策路径 读取评估标准(允许的决策路径) 发送路径分析结果 发送性能数据 发送性能分析结果 生成测试报告 发送测试报告 通知测试结果(成功/失败)

4.7 数学模型:AI Agent可靠性的量化评估

为了量化评估AI Agent的可靠性,我们需要建立一个数学模型。我们的数学模型基于层次分析法(AHP, Analytic Hierarchy Process)——因为层次分析法非常适合用来处理多维度、多层次的决策问题。

4.7.1 层次分析法的基本步骤

层次分析法的基本步骤如下:

  1. 建立层次结构模型:把问题分解为目标层、准则层、子准则层;
  2. 构造判断矩阵:对同一层次的各元素关于上一层次中某一准则的重要性进行两两比较,构造判断矩阵;
  3. 计算权重向量:计算判断矩阵的最大特征值和对应的特征向量,然后归一化得到权重向量;
  4. 一致性检验:检验判断矩阵的一致性,如果一致性比例CR<0.1,则认为判断矩阵是一致的;否则需要重新构造判断矩阵;
  5. 计算综合得分:根据各子准则的得分和权重,计算准则层的得分,然后计算目标层的综合得分。
4.7.2 AI Agent可靠性的层次结构模型

我们的AI Agent可靠性的层次结构模型如下:

  • 目标层(O):AI Agent的综合可靠性得分;
  • 准则层(C):容错性(C1)、性能(C2)、一致性(C3);
  • 子准则层(S)
    • 容错性(C1)的子准则:输入容错性(S11)、外部资源容错性(S12)、记忆容错性(S13)、任务容错性(S14);
    • 性能(C2)的子准则:响应时间(S21)、吞吐量(S22)、资源占用(S23)、可扩展性(S24);
    • 一致性(C3)的子准则:输出语义一致性(S31)、决策路径一致性(S32)、跨版本一致性(S33)。
4.7.3 构造判断矩阵并计算权重向量

接下来,我们需要构造判断矩阵并计算权重向量。我们邀请了10位AI Agent开发和测试的专家,对各元素的重要性进行了两两比较,然后取平均值得到了以下的判断矩阵。

4.7.3.1 准则层相对于目标层的判断矩阵

准则层相对于目标层的判断矩阵如下(重要性标度:1=同等重要,3=稍微重要,5=明显重要,7=强烈重要,9=极端重要,2,4,6,8=中间值):

O C1 C2 C3
C1 1 1 3
C2 1 1 3
C3 1/3 1/3 1

接下来,我们计算这个判断矩阵的最大特征值和对应的特征向量,然后归一化得到权重向量:

  1. 计算每一列的和:
    • 第一列:1 + 1 + 1/3 = 7/3 ≈ 2.3333
    • 第二列:1 + 1 + 1/3 = 7/3 ≈ 2.3333
    • 第三列:3 + 3 + 1 = 7
  2. 把每个元素除以所在列的和,得到归一化矩阵:
    O C1 C2 C3
    C1 3/7 3/7 3/7
    C2 3/7 3/7 3/7
    C3 1/7 1/7 1/7
  3. 计算归一化矩阵每一行的平均值,得到权重向量:
    • w C 1 = ( 3 / 7 + 3 / 7 + 3 / 7 ) / 3 = 3 / 7 ≈ 0.4286 w_{C1} = (3/7 + 3/7 + 3/7)/3 = 3/7 ≈ 0.4286 wC1=(3/7+3/7+3/7)/3=3/70.4286
    • w C 2 = ( 3 / 7 + 3 / 7 + 3 / 7 ) / 3 = 3 / 7 ≈ 0.4286 w_{C2} = (3/7 + 3/7 + 3/7)/3 = 3/7 ≈ 0.4286 wC2=(3/7+3/7+3/7)/3=3/70.4286
    • w C 3 = ( 1 / 7 + 1 / 7 + 1 / 7 ) / 3 = 1 / 7 ≈ 0.1428 w_{C3} = (1/7 + 1/7 + 1/7)/3 = 1/7 ≈ 0.1428 wC3=(1/7+1/7+1/7)/3=1/70.1428
  4. 计算最大特征值 λ m a x \lambda_{max} λmax
    • 首先计算 A w Aw Aw,其中 A A A是判断矩阵, w w w是权重向量:
      A w = [ 1 1 3 1 1 3 1 / 3 1 / 3 1 ] [ 3 / 7 3 / 7 1 / 7 ] = [ ( 1 ∗ 3 / 7 ) + ( 1 ∗ 3 / 7 ) + ( 3 ∗ 1 / 7 ) ( 1 ∗ 3 / 7 ) + ( 1 ∗ 3 / 7 ) + ( 3 ∗ 1 / 7 ) ( 1 / 3 ∗ 3 / 7 ) + ( 1 / 3 ∗ 3 / 7 ) + ( 1 ∗ 1 / 7 ) ] = [ 9 / 7 9 / 7 3 / 7 ] Aw = \begin{bmatrix} 1 & 1 & 3 \\ 1 & 1 & 3 \\ 1/3 & 1/3 & 1 \end{bmatrix} \begin{bmatrix} 3/7 \\ 3/7 \\ 1/7 \end{bmatrix} = \begin{bmatrix} (1*3/7)+(1*3/7)+(3*1/7) \\ (1*3/7)+(1*3/7)+(3*1/7) \\ (1/3*3/7)+(1/3*3/7)+(1*1/7) \end{bmatrix} = \begin{bmatrix} 9/7 \\ 9/7 \\ 3/7 \end{bmatrix} Aw= 111/3111/3331 3/73/71/7 = (13/7)+(13/7)+(31/7)(13/7)+(13/7)+(31/7)(1/33/7)+(1/33/7)+(11/7) = 9/79/73/7
    • 然后计算 λ m a x = 1 n ∑ i = 1 n ( A w ) i w i \lambda_{max} = \frac{1}{n} \sum_{i=1}^{n} \frac{(Aw)_i}{w_i} λmax=n1i=1nwi(Aw)i,其中 n n n是判断矩阵的阶数:
      λ m a x = 1 3 ( 9 / 7 3 / 7 + 9 / 7 3 / 7 + 3 / 7 1 / 7 ) = 1 3 ( 3 + 3 + 3 ) = 3 \lambda_{max} = \frac{1}{3} \left( \frac{9/7}{3/7} + \frac{9/7}{3/7} + \frac{3/7}{1/7} \right) = \frac{1}{3} (3 + 3 + 3) = 3 λmax=31(3/79/7+3/79/7+1/73/7)=31(3+3+3)=3
  5. 进行一致性检验:
    • 计算一致性指标 C I = λ m a x − n n − 1 = 3 − 3 3 − 1 = 0 CI = \frac{\lambda_{max} - n}{n - 1} = \frac{3 - 3}{3 - 1} = 0 CI=n1λmaxn=3133=0
    • 查找平均随机一致性指标 R I RI RI(根据阶数 n n n n = 3 n=3 n=3 R I = 0.58 RI=0.58 RI=0.58);
    • 计算一致性比例 C R = C I R I = 0 0.58 = 0 < 0.1 CR = \frac{CI}{RI} = \frac{0}{0.58} = 0 < 0.1 CR=RICI=0.580=0<0.1,所以判断矩阵是一致的。
4.7.3.2 子准则层相对于准则层的判断矩阵

由于篇幅限制,这里我们只给出子准则层相对于准则层的判断矩阵的权重向量和一致性检验结果(完整的判断矩阵和计算过程可以参考我们的GitHub仓库):

  • 容错性(C1)的子准则权重向量
    • w S 11 = 0.4 w_{S11} = 0.4 wS11=0.4
    • w S 12 = 0.4 w_{S12} = 0.4 wS12=0.4
    • w S 13 = 0.1 w_{S13} = 0.1 wS13=0.1
    • w S 14 = 0.1 w_{S14} = 0.1 wS14=0.1
    • 一致性比例 C R = 0.05 < 0.1 CR = 0.05 < 0.1 CR=0.05<0.1,一致。
  • 性能(C2)的子准则权重向量
    • w S 21 = 0.4 w_{S21} = 0.4 wS21=0.4
    • w S 22 = 0.4 w_{S22} = 0.4 wS22=0.4
    • w S 23 = 0.15 w_{S23} = 0.15 wS23=0.15
    • w S 24 = 0.05 w_{S24} = 0.05 wS24=0.05
    • 一致性比例 C R = 0.07 < 0.1 CR = 0.07 < 0.1 CR=0.07<0.1,一致。
  • 一致性(C3)的子准则权重向量
    • w S 31 = 0.35 w_{S31} = 0.35 wS31=0.35
    • w S 32 = 0.35 w_{S32} = 0.35 wS32=0.35
    • w S 33 = 0.3 w_{S33} = 0.3 wS33=0.3
    • 一致性比例 C R = 0.03 < 0.1 CR = 0.03 < 0.1 CR=0.03<0.1,一致。
4.7.4 子准则的得分计算

接下来,我们需要计算每个子准则的得分。每个子准则的得分范围是0-100分,得分越高表示可靠性越好。

4.7.4.1 容错性子准则的得分计算
  • 输入容错性(S11)的得分 S 11 = N p a s s N t o t a l × 100 S_{11} = \frac{N_{pass}}{N_{total}} \times 100 S11=NtotalNpass×100,其中 N p a s s N_{pass} Npass是输入容错性测试用例中通过的数量, N t o t a l N_{total} Ntotal是输入容错性测试用例的总数量;
  • 外部资源容错性(S12)的得分 S 12 = N p a s s N t o t a l × 100 S_{12} = \frac{N_{pass}}{N_{total}} \times 100 S12=NtotalNpass×100
  • 记忆容错性(S13)的得分 S 13 = N p a s s N t o t a l × 100 S_{13} = \frac{N_{pass}}{N_{total}} \times 100 S13=NtotalNpass×100
  • 任务容错性(S14)的得分 S 14 = N p a s s N t o t a l × 100 S_{14} = \frac{N_{pass}}{N_{total}} \times 100 S14=NtotalNpass×100
4.7.4.2 性能子准则的得分计算
  • 响应时间(S21)的得分
    • 对于普通任务(比如单轮聊天、单工具调用): S 21 , s i m p l e = { 100 , t ≤ 1 s 100 − 50 × ( t − 1 ) , 1 s < t ≤ 2 s 0 , t > 2 s S_{21, simple} = \begin{cases} 100, & t \leq 1s \\ 100 - 50 \times (t - 1), & 1s < t \leq 2s \\ 0, & t > 2s \end{cases} S21,simple= 100,10050×(t1),0,t1s1s<t2st>2s
    • 对于复杂任务(比如多轮对话、多工具调用): S 21 , c o m p l e x = { 100 , t ≤ 5 s 100 − 20 × ( t − 5 ) , 5 s < t ≤ 10 s 0 , t > 10 s S_{21, complex} = \begin{cases} 100, & t \leq 5s \\ 100 - 20 \times (t - 5), & 5s < t \leq 10s \\ 0, & t > 10s \end{cases} S21,complex= 100,10020×(t5),0,t5s5s<t10st>10s
    • 总的响应时间得分:$S_{2
Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐