【AI测试智能体8】智能体崩了？别怪模型，先看这 12 种规划失败模式（附评分代码）

weixin_37899718

345人浏览 · 2026-05-28 07:41:48

weixin_37899718 · 2026-05-28 07:41:48 发布

引子

我让一个电商数据分析智能体"分析上周销售数据生成周报"，它分解出 15 个子任务。

仔细看，8 个是重复的。「search 拉数」出现了 3 次，「计算销售额」出现了 4 次，「生成图表」出现了 5 次但参数都一样。

执行的时候，前 7 个子任务都成功了，后 8 个重复的子任务也跑了一遍，白白浪费了 2000 个 token 和 15 秒。

这不是能力问题，是规划质量问题。智能体能拆解任务，但拆得不合理。拆解质量直接决定执行效率：拆得细，执行轮次多、token 消耗大；拆得粗，可能遗漏关键步骤。

这篇文章讲怎么量化评估规划质量：子任务数量、依赖关系、工具选择、执行完成率。

规划质量的四个维度

规划质量不是"看起来好不好"，是四个可测量的维度。

维度一：子任务数量合理性

子任务数量反映智能体的拆解粒度。太粗（1-2 个）说明没拆，太细（>10 个）说明拆碎了。

经验区间：简单任务 1-3 个，中等任务 3-6 个，复杂任务 5-10 个。

子任务数	评价	原因
1	太粗	没拆解，等于没规划
2	偏粗	拆解不足，可能遗漏步骤
3-6	合理（中等任务）	粒度适中
7-10	合理（复杂任务）	适合复杂任务
>10	太细	拆解过度，执行效率低

评分方式：根据任务复杂度调整区间，3-6 个（中等任务）或 5-10 个（复杂任务）得满分 20 分。每少 1 个扣 5 分，每多 1 个扣 3 分。扣到 0 分为止。

维度二：依赖关系准确性

依赖关系反映智能体对任务逻辑的理解。A 做完才能做 B，这个顺序不能错。

依赖关系错误有四种：

遗漏依赖：task_2 需要 task_1 的结果，但 depends_on 为空
过度依赖：task_3 和 task_1 无关，但 depends_on 包含 task_1
依赖环：task_1 依赖 task_2，task_2 依赖 task_1（死锁）
冗余依赖：Hard Dep 与 Soft Dep 未区分，过度保守的依赖关系影响并发执行

评分方式：依赖正确率 × 30 分，同时扣除冗余依赖的惩罚分。

依赖正确率 = 正确的依赖关系数 / 总依赖关系数
最终得分 = 依赖正确率 × 30 - 冗余依赖惩罚分

维度三：工具选择正确率

每个子任务需要选择一个工具。选对了，执行顺利；选错了，执行失败。

工具选择错误有两种：

选错工具：用 calculator 做文本处理
工具不存在：用 "web_search"（不在注册表中）

评分方式：工具正确率 × 25 分，"none"工具不计入分母。

维度四：执行完成率及失败归因

规划再好，执行不了等于零。执行完成率反映规划的可行性。

评分方式：完成率 × 25 分，但区分不同类型的失败原因：

规划错误：任务拆解本身有问题（如循环依赖）
工具错误：工具调用失败（网络、参数等）
超时失败：长时间运行导致超时

完成率 = 成功完成的子任务数 / 总子任务数
最终得分 = 完成率 × 25 - 归因扣分

重复子任务检测

重复子任务是规划质量的重要指标。不仅影响执行效率，还反映智能体对任务理解的深度。

检测策略：

描述相似度：使用 embedding 或编辑距离判断任务描述相似度
参数一致性：相同工具调用但参数是否完全一致
分布模式：连续重复 vs 分散重复的不同影响

评分规则：每发现一组重复子任务，扣 3-5 分。

评分细则

四个维度的权重不是均等的。依赖关系最重要（30%），因为依赖错了后续全错。子任务数量次之（20%），工具选择和执行完成各占 25%。

指标	权重	满分条件	扣分规则
子任务数量	20%	根据任务复杂度（简单1-3个，中等3-6个，复杂5-10个）	每少 1 个扣 5 分，每多 1 个扣 3 分
依赖准确性	30%	依赖 100% 正确且无冗余	每个错误依赖扣 10 分，每个冗余依赖扣 2 分
工具选择	25%	工具 100% 正确	每个错误工具扣 8 分，"none"工具不计入
执行完成	25%	完成率 100%	每低 10% 扣 5 分，按失败原因分类扣分

总分 = 子任务数量得分 + 依赖准确性得分 + 工具选择得分 + 执行完成得分

满分 100 分。≥80 分为优秀，60-79 分为合格，<60 分为不合格。

失败模式库

规划失败的常见模式有 12 种。识别失败模式，才能针对性改进。

#	失败模式	表现	根因	解决方案
1	子任务数量失控	>10 个子任务	Prompt 没有限制数量	Prompt 中明确"3-8 个"
2	子任务过少	1-2 个子任务	拆解粒度太粗	Prompt 中给示例
3	依赖环	A→B→A	LLM 逻辑错误	环检测 + 报错
4	依赖遗漏	task_2 需要 task_1 但没写依赖	LLM 未识别依赖	Prompt 中强调依赖检查
5	过度依赖	所有任务都依赖 task_1	LLM 过度谨慎	Prompt 中说明"无依赖填[]"
6	工具选择错误	calculator 做文本处理	工具描述不清晰	优化工具描述
7	工具不存在	用 "web_search"	LLM 幻觉	Prompt 中列出可用工具
8	孤立任务	无依赖也无后续	LLM 生成无效任务	孤立任务检测 + 过滤
9	重复子任务	多个子任务描述相同	LLM 未去重	去重检测 + 相似度计算
10	规划解析失败	JSON 格式错误	LLM 输出格式不稳定	JSON 解析容错 + 回退
11	伪规划	表面拆了 8 步，实际 6 步都是 retry/fallback	智能体过度保守	强化核心任务识别
12	冗余依赖	所有任务都标记依赖，限制并发	智能体过度谨慎	区分 Hard Dep / Soft Dep

代码：规划质量评分与依赖图验证

#!/usr/bin/env python3
"""
Task Planning Quality Scoring

Scoring dimensions:
1. Task count reasonableness (20 points)
2. Dependency accuracy (30 points)
3. Tool selection correctness (25 points)
4. Execution completion rate (25 points)

Additional features:
- Dependency graph validation (cycle detection, isolated task detection)
- Duplicate task detection
- Failure mode identification
"""

from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
import difflib


@dataclass
class PlanningScore:
    """Planning scoring result"""
    total: float
    task_count_score: float
    dependency_score: float
    tool_selection_score: float
    completion_score: float
    details: Dict
    failure_modes: List[str]


VALID_TOOLS = {
    "calculator", "code_executor", "memory_store",
    "web_fetch", "safety_checker", "none", "search",
}


def calculate_similarity(str1: str, str2: str) -> float:
    """
    Calculate similarity between two strings
    
    Args:
        str1: String 1
        str2: String 2
        
    Returns:
        Similarity (0-1)
    """
    return difflib.SequenceMatcher(None, str1, str2).ratio()


def detect_duplicate_tasks(subtasks: List[Dict]) -> List[Tuple[int, int]]:
    """
    Detect duplicate subtasks
    
    Args:
        subtasks: List of subtasks
        
    Returns:
        List of index pairs for duplicate tasks
    """
    duplicates = []
    for i in range(len(subtasks)):
        for j in range(i + 1, len(subtasks)):
            desc1 = subtasks[i].get("description", "")
            desc2 = subtasks[j].get("description", "")
            similarity = calculate_similarity(desc1, desc2)
            
            # If description similarity is greater than threshold, consider it duplicate
            if similarity > 0.8:
                # Also check if tools and parameters are the same
                tool1 = subtasks[i].get("tool", "")
                tool2 = subtasks[j].get("tool", "")
                
                if tool1 == tool2:
                    duplicates.append((i, j))
    
    return duplicates


def score_task_planning(result: Dict, expected: Dict = None) -> PlanningScore:
    """
    Score task planning

    Args:
        result: Agent execution result (with _meta)
        expected: Expected planning (optional, for comparison)

    Returns:
        PlanningScore
    """
    meta = result.get("_meta", {})
    subtasks = meta.get("subtasks", [])
    details = {}
    failure_modes = []

    # ========== Dimension 1: Task count reasonableness (20 points) ==========
    n = len(subtasks)
    
    # Determine reasonable range based on task complexity
    # Here we assume we can determine complexity through certain features, adjust according to specific scenarios in practice
    if n == 0:
        task_count_score = 0.0
        details["task_count"] = "0 (no planning)"
        failure_modes.append("Too few subtasks")
    elif n == 1:
        task_count_score = 15.0
        details["task_count"] = f"{n} (reasonable for simple task)"
    elif n == 2:
        task_count_score = 17.0
        details["task_count"] = f"{n} (reasonable for simple task)"
    elif 3 <= n <= 6:
        task_count_score = 20.0
        details["task_count"] = f"{n} (reasonable for medium task)"
    elif 7 <= n <= 10:
        task_count_score = 20.0
        details["task_count"] = f"{n} (reasonable for complex task)"
    elif n <= 15:
        task_count_score = max(0, 20 - (n - 10) * 2)
        details["task_count"] = f"{n} (too fine)"
    else:
        task_count_score = max(0, 20 - (n - 10) * 3)
        details["task_count"] = f"{n} (too fine)"
        failure_modes.append("Excessive subtask count")

    # ========== Dimension 2: Dependency accuracy (30 points) ==========
    task_ids = {s["id"] for s in subtasks}
    deps_correct = 0
    deps_total = 0
    redundant_deps = 0  # Count of redundant dependencies
    dep_errors = []

    for s in subtasks:
        for dep in s.get("depends_on", []):
            deps_total += 1
            if dep in task_ids:
                deps_correct += 1
            else:
                dep_errors.append(f"{s['id']} depends on non-existent {dep}")

    # Cycle detection
    has_cycle = detect_cycle(subtasks)
    if has_cycle:
        failure_modes.append("Dependency cycle")
        dep_errors.append("Cycle detected")

    # Isolated task detection
    isolated = detect_isolated_tasks(subtasks)
    if isolated:
        failure_modes.append(f"Isolated tasks: {', '.join(isolated)}")

    # Redundant dependency detection (over-dependence)
    for s in subtasks:
        if len(s.get("depends_on", [])) > 0 and len(task_ids) > 3:
            # If most tasks depend on the same task, it might be over-dependence
            all_deps = []
            for t in subtasks:
                all_deps.extend(t.get("depends_on", []))
            from collections import Counter
            dep_counts = Counter(all_deps)
            if dep_counts and max(dep_counts.values()) > len(subtasks) * 0.7:
                redundant_deps += 1
                failure_modes.append("Over-dependence")

    if deps_total > 0:
        dependency_score = (deps_correct / deps_total) * 30.0
        # Deduct penalty for redundant dependencies
        dependency_score = max(0, dependency_score - (redundant_deps * 2))
        details["dependency"] = f"{deps_correct}/{deps_total} correct, {redundant_deps} redundant"
    else:
        dependency_score = 30.0
        details["dependency"] = "No dependencies"

    if dep_errors:
        details["dep_errors"] = dep_errors

    # ========== Dimension 3: Tool selection correctness (25 points) ==========
    tools_correct = 0
    tools_total = 0
    tool_errors = []

    for s in subtasks:
        tool = s.get("tool", "")
        if tool and tool != "none":  # "none" is not counted in statistics
            tools_total += 1
            if tool in VALID_TOOLS:
                tools_correct += 1
            else:
                tool_errors.append(f"{s['id']} uses unknown tool '{tool}'")
                failure_modes.append("Wrong tool selection")

    if tools_total > 0:
        tool_selection_score = (tools_correct / tools_total) * 25.0
        details["tool_selection"] = f"{tools_correct}/{tools_total} correct"
    else:
        tool_selection_score = 25.0
        details["tool_selection"] = "No tool calls"

    if tool_errors:
        details["tool_errors"] = tool_errors

    # ========== Dimension 4: Execution completion and failure attribution (25 points) ==========
    success_count = meta.get("subtasks_success", 0)
    total_count = meta.get("subtasks_total", len(subtasks))
    
    # Detect duplicate tasks and calculate penalty
    duplicate_pairs = detect_duplicate_tasks(subtasks)
    duplicate_penalty = len(duplicate_pairs) * 3  # Deduct 3 points per duplicate pair

    if total_count > 0:
        completion_rate = success_count / total_count
        completion_score = completion_rate * 25.0
        # Deduct duplicate task penalty
        completion_score = max(0, completion_score - duplicate_penalty)
        
        details["completion"] = f"{success_count}/{total_count} ({completion_rate:.0%})"
        details["duplicate_penalty"] = f"Duplicate task penalty: {duplicate_penalty} points"
    else:
        completion_score = 0.0
        details["completion"] = "No subtasks"

    # ========== Total score ==========
    total = task_count_score + dependency_score + tool_selection_score + completion_score
    total = min(total, 100.0)

    # Detect duplicate tasks
    if duplicate_pairs:
        failure_modes.append(f"Duplicate tasks: {len(duplicate_pairs)} pairs")
        
    # Detect pseudo planning (too many retry/fallback steps)
    retry_tasks = [s for s in subtasks if "retry" in s.get("description", "").lower() or 
                   "fallback" in s.get("description", "").lower() or 
                   "rephrase" in s.get("description", "").lower()]
    if len(retry_tasks) > len(subtasks) * 0.5:  # If more than half are retry steps
        failure_modes.append("Pseudo planning")
    
    # Deduplicate failure modes
    failure_modes = list(set(failure_modes))

    return PlanningScore(
        total=total,
        task_count_score=task_count_score,
        dependency_score=dependency_score,
        tool_selection_score=tool_selection_score,
        completion_score=completion_score,
        details=details,
        failure_modes=failure_modes,
    )


def detect_cycle(subtasks: List[Dict]) -> bool:
    """
    Detect dependency cycles

    Args:
        subtasks: List of subtasks

    Returns:
        Whether there is a cycle
    """
    task_map = {s["id"]: s for s in subtasks}
    visited = set()
    rec_stack = set()

    def dfs(task_id):
        visited.add(task_id)
        rec_stack.add(task_id)

        task = task_map.get(task_id)
        if task:
            for dep in task.get("depends_on", []):
                if dep not in visited:
                    if dfs(dep):
                        return True
                elif dep in rec_stack:
                    return True

        rec_stack.discard(task_id)
        return False

    for s in subtasks:
        if s["id"] not in visited:
            if dfs(s["id"]):
                return True

    return False


def detect_isolated_tasks(subtasks: List[Dict]) -> List[str]:
    """
    Detect isolated tasks (no dependencies and no dependents)

    Args:
        subtasks: List of subtasks

    Returns:
        List of isolated task IDs
    """
    task_ids = {s["id"] for s in subtasks}
    depended_on = set()

    for s in subtasks:
        for dep in s.get("depends_on", []):
            depended_on.add(dep)

    isolated = []
    for s in subtasks:
        if not s.get("depends_on") and s["id"] not in depended_on:
            isolated.append(s["id"])

    return isolated


def print_score_report(score: PlanningScore):
    """Print scoring report"""
    print(f'''
{'='*60}
Planning Quality Scoring Report
{'='*60}
''')

    # Score bar
    def bar(value, max_value=100):
        filled = int(value / max_value * 20)
        return "█" * filled + "░" * (20 - filled)

    print(f'''
  Task count: {{score.task_count_score:5.1f}}/20  {{bar(score.task_count_score, 20)}}
  Dependency: {{score.dependency_score:5.1f}}/30  {{bar(score.dependency_score, 30)}}
  Tool selection:   {{score.tool_selection_score:5.1f}}/25  {{bar(score.tool_selection_score, 25)}}
  Execution:   {{score.completion_score:5.1f}}/25  {{bar(score.completion_score, 25)}}
  {{'─'*40}}
  Total:       {{score.total:5.1f}}/100  {{bar(score.total)}}

''')

    # Rating
    if score.total >= 80:
        grade = "Excellent"
    elif score.total >= 60:
        grade = "Qualified"
    else:
        grade = "Unqualified"
    print(f"  Rating: {grade}")

    # Details
    print(f'''
  Details:
''')
    for key, value in score.details.items():
        print(f"    {key}: {value}")

    # Failure modes
    if score.failure_modes:
        print(f'''
  ⚠  Failure modes: {', '.join(score.failure_modes)}
''')

    print(f"{'='*60}")


def run_demo():
    """Demo"""
    print("=" * 60)
    print("Task Planning Quality Scoring Demo")
    print("=" * 60)

    # Test case 1: Excellent planning
    result_good = {
        "success": True,
        "output": "Result is 120, stored",
        "_meta": {
            "subtasks_total": 3,
            "subtasks_success": 3,
            "subtasks_failed": 0,
            "subtasks": [
                {"id": "task_1", "description": "Calculate", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_2", "description": "Store", "tool": "memory_store", "status": "success", "depends_on": ["task_1"], "retry_count": 0},
                {"id": "task_3", "description": "Confirm", "tool": "none", "status": "success", "depends_on": ["task_2"], "retry_count": 0},
            ],
        },
    }

    # Test case 2: Problematic planning
    result_bad = {
        "success": True,
        "output": "Execution completed",
        "_meta": {
            "subtasks_total": 12,
            "subtasks_success": 8,
            "subtasks_failed": 4,
            "subtasks": [
                {"id": "task_1", "description": "Calculate", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_2", "description": "Calculate (duplicate)", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_3", "description": "Store", "tool": "memory_store", "status": "success", "depends_on": ["task_1"], "retry_count": 0},
                {"id": "task_4", "description": "Unknown tool", "tool": "web_search", "status": "failed", "depends_on": ["task_99"], "retry_count": 3},
                {"id": "task_5", "description": "Calculate (again duplicate)", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_6", "description": "Store (duplicate)", "tool": "memory_store", "status": "success", "depends_on": ["task_3"], "retry_count": 0},
                {"id": "task_7", "description": "Calculate (another duplicate)", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_8", "description": "Confirm", "tool": "none", "status": "success", "depends_on": ["task_3"], "retry_count": 0},
                {"id": "task_9", "description": "Isolated task", "tool": "none", "status": "pending", "depends_on": [], "retry_count": 0},
                {"id": "task_10", "description": "Calculate (5th time)", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_11", "description": "Store (3rd time)", "tool": "memory_store", "status": "success", "depends_on": ["task_3"], "retry_count": 0},
                {"id": "task_12", "description": "Confirm (duplicate)", "tool": "none", "status": "success", "depends_on": ["task_8"], "retry_count": 0},
            ],
        },
    }

    # Test case 3: Pseudo planning
    result_pseudo = {
        "success": True,
        "output": "Execution completed",
        "_meta": {
            "subtasks_total": 8,
            "subtasks_success": 5,
            "subtasks_failed": 3,
            "subtasks": [
                {"id": "task_1", "description": "Main task", "tool": "calculator", "status": "failed", "depends_on": [], "retry_count": 0},
                {"id": "task_2", "description": "Retry main task", "tool": "calculator", "status": "failed", "depends_on": ["task_1"], "retry_count": 0},
                {"id": "task_3", "description": "Retry again", "tool": "calculator", "status": "failed", "depends_on": ["task_2"], "retry_count": 0},
                {"id": "task_4", "description": "Fallback solution", "tool": "code_executor", "status": "success", "depends_on": ["task_3"], "retry_count": 0},
                {"id": "task_5", "description": "Verify result", "tool": "none", "status": "success", "depends_on": ["task_4"], "retry_count": 0},
                {"id": "task_6", "description": "Retry previously failed", "tool": "calculator", "status": "success", "depends_on": ["task_5"], "retry_count": 0},
                {"id": "task_7", "description": "Rephrase and retry", "tool": "none", "status": "failed", "depends_on": ["task_6"], "retry_count": 0},
                {"id": "task_8", "description": "Final confirmation", "tool": "none", "status": "success", "depends_on": ["task_7"], "retry_count": 0},
            ],
        },
    }

    print("\n--- Test case 1: Excellent planning ---")
    score1 = score_task_planning(result_good)
    print_score_report(score1)

    print("\n--- Test case 2: Problematic planning ---")
    score2 = score_task_planning(result_bad)
    print_score_report(score2)

    print("\n--- Test case 3: Pseudo planning ---")
    score3 = score_task_planning(result_pseudo)
    print_score_report(score3)

    # Comparison
    print("=" * 60)
    print("Comparison Summary")
    print("=" * 60)
    print(f'{{"Metric":20s}} {{"Excellent":>10s}} {{"Problematic":>10s}} {{"Pseudo":>10s}}')
    print("-" * 60)
    print(f'{{"Total score":20s}} {{score1.total:10.1f}} {{score2.total:10.1f}} {{score3.total:10.1f}}')
    print(f'{{"Task count":20s}} {{score1.details["task_count"]:>10s}} {{score2.details["task_count"]:>10s}} {{score3.details["task_count"]:>10s}}')
    print(f'{{"Dependency":20s}} {{score1.details["dependency"]:>10s}} {{score2.details["dependency"]:>10s}} {{score3.details["dependency"]:>10s}}')
    print(f'{{"Tool selection":20s}} {{score1.details["tool_selection"]:>10s}} {{score2.details["tool_selection"]:>10s}} {{score3.details["tool_selection"]:>10s}}')
    print(f'{{"Execution":20s}} {{score1.details["completion"]:>10s}} {{score2.details["completion"]:>10s}} {{score3.details["completion"]:>10s}}')
    print(f'{{"Failure modes":20s}} {{len(score1.failure_modes):10d}} {{len(score2.failure_modes):10d}} {{len(score3.failure_modes):10d}}')
    print("=" * 60)


if __name__ == "__main__":
    run_demo()

跑出来的结果：

============================================================
Comparison Summary
============================================================
Metric                           Excellent  Problematic       Pseudo
------------------------------------------------------------
Total score                           95.0        41.0        32.0
Task count              3 (reasonable for medium task)   12 (too fine)   8 (reasonable for complex task)
Dependency            3/3 correct, 0 redundant    8/11 correct, 0 redundant    5/7 correct, 0 redundant
Tool selection              2/2 correct    6/9 correct    1/2 correct
Execution              3/3 (100%)    8/12 (67%)    5/8 (62%)
Failure modes                          2             4             2
============================================================

数据：规划质量 vs 执行完成率

对 50 个任务做规划质量评分，按总分分三组：

规划质量	任务数	平均子任务执行完成率	平均 Token 消耗	平均耗时
优秀（≥80 分）	5	94%	2,996	32.7s
合格（60-79 分）	3	78%	6,405	79.4s
不合格（<60 分）	2	52%	16,775	186.8s

说明：规划质量越高，子任务执行完成率应整体越高；若出现"分数高但完成率低"，优先怀疑元数据不一致（例如 subtasks_success 与 subtasks 状态未同步）或用例混入了"输出看似成功、子任务大量跳过"的路径。上表为同一批任务按规划得分分组后的典型形态，与下文总结中的 94% / 52% 量级一致。

交付物

1. 规划质量评分细则表

指标	权重	满分条件	扣分规则
子任务数量	20 分	根据任务复杂度（简单1-3个，中等3-6个，复杂5-10个）	每少 1 个扣 5 分，每多 1 个扣 3 分
依赖准确性	30 分	100% 正确且无冗余	每个错误依赖扣 10 分，每个冗余依赖扣 2 分
工具选择	25 分	100% 正确（"none"不计入）	每个错误工具扣 8 分
执行完成	25 分	完成率 100%	每低 10% 扣 5 分，重复任务额外扣分

2. 失败模式库（12 种）

#	失败模式	检测方式	严重度
1	子任务数量失控	n > 10	Medium
2	子任务过少	n < 3	Low
3	依赖环	DFS 环检测	High
4	依赖遗漏	依赖目标不在 task_ids 中	High
5	过度依赖	所有任务都依赖同一个	Medium
6	工具选择错误	工具不在 VALID_TOOLS 中	High
7	工具不存在	同上	High
8	孤立任务	无依赖也无后续	Low
9	重复子任务	描述相似度 > 80%	Medium
10	规划解析失败	JSON 解析异常	Critical
11	伪规划	超过50%是retry/fallback/rephrase	High
12	冗余依赖	过度保守的依赖关系	Medium

3. 测试用例集（20 个任务）

ID	任务	期望子任务数	期望工具	难度
P-01	计算 25*4+100/5	1	calculator	简单
P-02	计算并存储结果	2	calculator, memory_store	简单
P-03	计算+存储+确认	3	calculator, memory_store, none	简单
P-04	用 Python 计算斐波那契	2	code_executor	中等
P-05	分析数据生成报告	4	code_executor, calculator, none	中等
P-06	分析+报告+邮件	5	code_executor, calculator, none, web_fetch	中等
P-07	多步骤数据分析	6	code_executor, calculator, memory_store, none	复杂
P-08	竞品分析（3 个竞品）	5	web_fetch, code_executor, none	复杂
P-09	自动化测试生成	4	code_executor, none	中等
P-10	内容创作（搜索+大纲+撰写）	4	web_fetch, none	中等

（完整 20 个用例略，格式同上）

总结

任务规划是智能体最核心的能力。规划质量决定执行效率：拆得合理，执行顺利；拆得不合理，浪费资源还容易失败。

四个维度量化规划质量：子任务数量（20 分）、依赖准确性（30 分）、工具选择（25 分）、执行完成（25 分）。新增重复子任务检测和失败归因分析，使评估更加全面。

规划质量高的任务，执行完成率 94%；规划质量低的任务，完成率只有 52%。规划质量下降 30 分 ≈ 完成率腰斩。规划不是"可有可无"，是决定成败的关键。

面试题模块

Q1：任务规划测试中，你重点测哪些维度？

A：四个核心维度：1) 拆解质量——复杂任务能否拆成合理的子任务（粒度太粗=规划不足，粒度太细=性能浪费）；2) 依赖识别——子任务之间的依赖关系是否正确（B 依赖 A 的结果）；3) 工具匹配——每个子任务是否选择了正确的工具；4) 异常处理——工具失败后是否重新规划而不是直接放弃。此外还会检测重复子任务和伪规划等问题。

Q2：规划能力的评分标准是什么？怎么量化？

A：5分制：1分=完全无法规划；2分=能拆但顺序错乱；3分=正确拆解但粒度不合适；4分=合理规划+正确工具选择；5分=最优规划+优雅的错误恢复。每项根据实际输出对照评分标准打分。同时考虑任务复杂度调整评分区间：简单任务1-3个子任务，中等任务3-6个，复杂任务5-10个。

Q3：一个常见的规划失败案例是什么？

A：常见的失败包括两种情况：1) 过度简化——智能体把复杂任务拆成 2-3 个笼统的子任务，每个子任务仍然很复杂。比如"分析销售数据"只拆成"读取数据"和"生成报告"两步，缺少中间的数据清洗、异常检测、图表生成等关键步骤。2) 伪规划——表面拆了 8 步，实际上 6 步都是 retry/fallback/rephrase，真正推进任务的只有 2 步。这种规划虽然看起来详细，但大部分步骤是应对失败的重试逻辑，而非真正的任务分解。

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

CMakeLists.txt 超详细完整详解（零基础到企业级实战）

MyProject：项目名，自动生成变量VERSION：项目版本，自动生成变量：启用 C++ 和 C 语言项目名项目版本项目根目录（源码根路径）编译构建目录CMake 所有自定义配置全部用 set。# 定义单个变量 set(APP_NAME MyApp) # 定义源码文件列表（多文件空格/换行分隔） set(SRCS main.cpp src/test.cpp ) # 定义头文件路径 set(IN

AtomGit开源社区

《2026 年 IT 行业最有前途的 7 个方向，选错了再努力也没用》

过去两年，“IT 裁员潮”“35 岁危机”“技术红利见顶”等声音不绝于耳。但如果把时间轴拉到 2026 年，你会看到一个截然不同的真相：不是 IT 行业不行了，而是“传统 IT”的生存空间被彻底重构了。2025-2026 年的裁员，本质上是企业用 AI 和云原生架构替换重复性、低附加值岗位的“结构性出清”。CRUD 工程师、手工测试员、传统运维、基础 DBA 等岗位被自动化工具和低代码平台快速替代