引子

我让一个电商数据分析智能体"分析上周销售数据生成周报",它分解出 15 个子任务。

仔细看,8 个是重复的。「search 拉数」出现了 3 次,「计算销售额」出现了 4 次,「生成图表」出现了 5 次但参数都一样。

执行的时候,前 7 个子任务都成功了,后 8 个重复的子任务也跑了一遍,白白浪费了 2000 个 token 和 15 秒。

这不是能力问题,是规划质量问题。智能体能拆解任务,但拆得不合理。拆解质量直接决定执行效率:拆得细,执行轮次多、token 消耗大;拆得粗,可能遗漏关键步骤。

这篇文章讲怎么量化评估规划质量:子任务数量、依赖关系、工具选择、执行完成率。

规划质量的四个维度

规划质量不是"看起来好不好",是四个可测量的维度。

维度一:子任务数量合理性

子任务数量反映智能体的拆解粒度。太粗(1-2 个)说明没拆,太细(>10 个)说明拆碎了。

经验区间:简单任务 1-3 个,中等任务 3-6 个,复杂任务 5-10 个。

子任务数 评价 原因
1 太粗 没拆解,等于没规划
2 偏粗 拆解不足,可能遗漏步骤
3-6 合理(中等任务) 粒度适中
7-10 合理(复杂任务) 适合复杂任务
>10 太细 拆解过度,执行效率低

评分方式:根据任务复杂度调整区间,3-6 个(中等任务)或 5-10 个(复杂任务)得满分 20 分。每少 1 个扣 5 分,每多 1 个扣 3 分。扣到 0 分为止。

维度二:依赖关系准确性

依赖关系反映智能体对任务逻辑的理解。A 做完才能做 B,这个顺序不能错。

依赖关系错误有四种:

  1. 遗漏依赖:task_2 需要 task_1 的结果,但 depends_on 为空
  2. 过度依赖:task_3 和 task_1 无关,但 depends_on 包含 task_1
  3. 依赖环:task_1 依赖 task_2,task_2 依赖 task_1(死锁)
  4. 冗余依赖:Hard Dep 与 Soft Dep 未区分,过度保守的依赖关系影响并发执行

评分方式:依赖正确率 × 30 分,同时扣除冗余依赖的惩罚分。

依赖正确率 = 正确的依赖关系数 / 总依赖关系数
最终得分 = 依赖正确率 × 30 - 冗余依赖惩罚分

维度三:工具选择正确率

每个子任务需要选择一个工具。选对了,执行顺利;选错了,执行失败。

工具选择错误有两种:

  1. 选错工具:用 calculator 做文本处理
  2. 工具不存在:用 "web_search"(不在注册表中)

评分方式:工具正确率 × 25 分,"none"工具不计入分母。

维度四:执行完成率及失败归因

规划再好,执行不了等于零。执行完成率反映规划的可行性。

评分方式:完成率 × 25 分,但区分不同类型的失败原因:

  • 规划错误:任务拆解本身有问题(如循环依赖)
  • 工具错误:工具调用失败(网络、参数等)
  • 超时失败:长时间运行导致超时
完成率 = 成功完成的子任务数 / 总子任务数
最终得分 = 完成率 × 25 - 归因扣分

重复子任务检测

重复子任务是规划质量的重要指标。不仅影响执行效率,还反映智能体对任务理解的深度。

检测策略:

  1. 描述相似度:使用 embedding 或编辑距离判断任务描述相似度
  2. 参数一致性:相同工具调用但参数是否完全一致
  3. 分布模式:连续重复 vs 分散重复的不同影响

评分规则:每发现一组重复子任务,扣 3-5 分。

评分细则

四个维度的权重不是均等的。依赖关系最重要(30%),因为依赖错了后续全错。子任务数量次之(20%),工具选择和执行完成各占 25%。

指标 权重 满分条件 扣分规则
子任务数量 20% 根据任务复杂度(简单1-3个,中等3-6个,复杂5-10个) 每少 1 个扣 5 分,每多 1 个扣 3 分
依赖准确性 30% 依赖 100% 正确且无冗余 每个错误依赖扣 10 分,每个冗余依赖扣 2 分
工具选择 25% 工具 100% 正确 每个错误工具扣 8 分,"none"工具不计入
执行完成 25% 完成率 100% 每低 10% 扣 5 分,按失败原因分类扣分

总分 = 子任务数量得分 + 依赖准确性得分 + 工具选择得分 + 执行完成得分

满分 100 分。≥80 分为优秀,60-79 分为合格,<60 分为不合格。

失败模式库

规划失败的常见模式有 12 种。识别失败模式,才能针对性改进。

# 失败模式 表现 根因 解决方案
1 子任务数量失控 >10 个子任务 Prompt 没有限制数量 Prompt 中明确"3-8 个"
2 子任务过少 1-2 个子任务 拆解粒度太粗 Prompt 中给示例
3 依赖环 A→B→A LLM 逻辑错误 环检测 + 报错
4 依赖遗漏 task_2 需要 task_1 但没写依赖 LLM 未识别依赖 Prompt 中强调依赖检查
5 过度依赖 所有任务都依赖 task_1 LLM 过度谨慎 Prompt 中说明"无依赖填[]"
6 工具选择错误 calculator 做文本处理 工具描述不清晰 优化工具描述
7 工具不存在 用 "web_search" LLM 幻觉 Prompt 中列出可用工具
8 孤立任务 无依赖也无后续 LLM 生成无效任务 孤立任务检测 + 过滤
9 重复子任务 多个子任务描述相同 LLM 未去重 去重检测 + 相似度计算
10 规划解析失败 JSON 格式错误 LLM 输出格式不稳定 JSON 解析容错 + 回退
11 伪规划 表面拆了 8 步,实际 6 步都是 retry/fallback 智能体过度保守 强化核心任务识别
12 冗余依赖 所有任务都标记依赖,限制并发 智能体过度谨慎 区分 Hard Dep / Soft Dep

代码:规划质量评分与依赖图验证

#!/usr/bin/env python3
"""
Task Planning Quality Scoring

Scoring dimensions:
1. Task count reasonableness (20 points)
2. Dependency accuracy (30 points)
3. Tool selection correctness (25 points)
4. Execution completion rate (25 points)

Additional features:
- Dependency graph validation (cycle detection, isolated task detection)
- Duplicate task detection
- Failure mode identification
"""

from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
import difflib


@dataclass
class PlanningScore:
    """Planning scoring result"""
    total: float
    task_count_score: float
    dependency_score: float
    tool_selection_score: float
    completion_score: float
    details: Dict
    failure_modes: List[str]


VALID_TOOLS = {
    "calculator", "code_executor", "memory_store",
    "web_fetch", "safety_checker", "none", "search",
}


def calculate_similarity(str1: str, str2: str) -> float:
    """
    Calculate similarity between two strings
    
    Args:
        str1: String 1
        str2: String 2
        
    Returns:
        Similarity (0-1)
    """
    return difflib.SequenceMatcher(None, str1, str2).ratio()


def detect_duplicate_tasks(subtasks: List[Dict]) -> List[Tuple[int, int]]:
    """
    Detect duplicate subtasks
    
    Args:
        subtasks: List of subtasks
        
    Returns:
        List of index pairs for duplicate tasks
    """
    duplicates = []
    for i in range(len(subtasks)):
        for j in range(i + 1, len(subtasks)):
            desc1 = subtasks[i].get("description", "")
            desc2 = subtasks[j].get("description", "")
            similarity = calculate_similarity(desc1, desc2)
            
            # If description similarity is greater than threshold, consider it duplicate
            if similarity > 0.8:
                # Also check if tools and parameters are the same
                tool1 = subtasks[i].get("tool", "")
                tool2 = subtasks[j].get("tool", "")
                
                if tool1 == tool2:
                    duplicates.append((i, j))
    
    return duplicates


def score_task_planning(result: Dict, expected: Dict = None) -> PlanningScore:
    """
    Score task planning

    Args:
        result: Agent execution result (with _meta)
        expected: Expected planning (optional, for comparison)

    Returns:
        PlanningScore
    """
    meta = result.get("_meta", {})
    subtasks = meta.get("subtasks", [])
    details = {}
    failure_modes = []

    # ========== Dimension 1: Task count reasonableness (20 points) ==========
    n = len(subtasks)
    
    # Determine reasonable range based on task complexity
    # Here we assume we can determine complexity through certain features, adjust according to specific scenarios in practice
    if n == 0:
        task_count_score = 0.0
        details["task_count"] = "0 (no planning)"
        failure_modes.append("Too few subtasks")
    elif n == 1:
        task_count_score = 15.0
        details["task_count"] = f"{n} (reasonable for simple task)"
    elif n == 2:
        task_count_score = 17.0
        details["task_count"] = f"{n} (reasonable for simple task)"
    elif 3 <= n <= 6:
        task_count_score = 20.0
        details["task_count"] = f"{n} (reasonable for medium task)"
    elif 7 <= n <= 10:
        task_count_score = 20.0
        details["task_count"] = f"{n} (reasonable for complex task)"
    elif n <= 15:
        task_count_score = max(0, 20 - (n - 10) * 2)
        details["task_count"] = f"{n} (too fine)"
    else:
        task_count_score = max(0, 20 - (n - 10) * 3)
        details["task_count"] = f"{n} (too fine)"
        failure_modes.append("Excessive subtask count")

    # ========== Dimension 2: Dependency accuracy (30 points) ==========
    task_ids = {s["id"] for s in subtasks}
    deps_correct = 0
    deps_total = 0
    redundant_deps = 0  # Count of redundant dependencies
    dep_errors = []

    for s in subtasks:
        for dep in s.get("depends_on", []):
            deps_total += 1
            if dep in task_ids:
                deps_correct += 1
            else:
                dep_errors.append(f"{s['id']} depends on non-existent {dep}")

    # Cycle detection
    has_cycle = detect_cycle(subtasks)
    if has_cycle:
        failure_modes.append("Dependency cycle")
        dep_errors.append("Cycle detected")

    # Isolated task detection
    isolated = detect_isolated_tasks(subtasks)
    if isolated:
        failure_modes.append(f"Isolated tasks: {', '.join(isolated)}")

    # Redundant dependency detection (over-dependence)
    for s in subtasks:
        if len(s.get("depends_on", [])) > 0 and len(task_ids) > 3:
            # If most tasks depend on the same task, it might be over-dependence
            all_deps = []
            for t in subtasks:
                all_deps.extend(t.get("depends_on", []))
            from collections import Counter
            dep_counts = Counter(all_deps)
            if dep_counts and max(dep_counts.values()) > len(subtasks) * 0.7:
                redundant_deps += 1
                failure_modes.append("Over-dependence")

    if deps_total > 0:
        dependency_score = (deps_correct / deps_total) * 30.0
        # Deduct penalty for redundant dependencies
        dependency_score = max(0, dependency_score - (redundant_deps * 2))
        details["dependency"] = f"{deps_correct}/{deps_total} correct, {redundant_deps} redundant"
    else:
        dependency_score = 30.0
        details["dependency"] = "No dependencies"

    if dep_errors:
        details["dep_errors"] = dep_errors

    # ========== Dimension 3: Tool selection correctness (25 points) ==========
    tools_correct = 0
    tools_total = 0
    tool_errors = []

    for s in subtasks:
        tool = s.get("tool", "")
        if tool and tool != "none":  # "none" is not counted in statistics
            tools_total += 1
            if tool in VALID_TOOLS:
                tools_correct += 1
            else:
                tool_errors.append(f"{s['id']} uses unknown tool '{tool}'")
                failure_modes.append("Wrong tool selection")

    if tools_total > 0:
        tool_selection_score = (tools_correct / tools_total) * 25.0
        details["tool_selection"] = f"{tools_correct}/{tools_total} correct"
    else:
        tool_selection_score = 25.0
        details["tool_selection"] = "No tool calls"

    if tool_errors:
        details["tool_errors"] = tool_errors

    # ========== Dimension 4: Execution completion and failure attribution (25 points) ==========
    success_count = meta.get("subtasks_success", 0)
    total_count = meta.get("subtasks_total", len(subtasks))
    
    # Detect duplicate tasks and calculate penalty
    duplicate_pairs = detect_duplicate_tasks(subtasks)
    duplicate_penalty = len(duplicate_pairs) * 3  # Deduct 3 points per duplicate pair

    if total_count > 0:
        completion_rate = success_count / total_count
        completion_score = completion_rate * 25.0
        # Deduct duplicate task penalty
        completion_score = max(0, completion_score - duplicate_penalty)
        
        details["completion"] = f"{success_count}/{total_count} ({completion_rate:.0%})"
        details["duplicate_penalty"] = f"Duplicate task penalty: {duplicate_penalty} points"
    else:
        completion_score = 0.0
        details["completion"] = "No subtasks"

    # ========== Total score ==========
    total = task_count_score + dependency_score + tool_selection_score + completion_score
    total = min(total, 100.0)

    # Detect duplicate tasks
    if duplicate_pairs:
        failure_modes.append(f"Duplicate tasks: {len(duplicate_pairs)} pairs")
        
    # Detect pseudo planning (too many retry/fallback steps)
    retry_tasks = [s for s in subtasks if "retry" in s.get("description", "").lower() or 
                   "fallback" in s.get("description", "").lower() or 
                   "rephrase" in s.get("description", "").lower()]
    if len(retry_tasks) > len(subtasks) * 0.5:  # If more than half are retry steps
        failure_modes.append("Pseudo planning")
    
    # Deduplicate failure modes
    failure_modes = list(set(failure_modes))

    return PlanningScore(
        total=total,
        task_count_score=task_count_score,
        dependency_score=dependency_score,
        tool_selection_score=tool_selection_score,
        completion_score=completion_score,
        details=details,
        failure_modes=failure_modes,
    )


def detect_cycle(subtasks: List[Dict]) -> bool:
    """
    Detect dependency cycles

    Args:
        subtasks: List of subtasks

    Returns:
        Whether there is a cycle
    """
    task_map = {s["id"]: s for s in subtasks}
    visited = set()
    rec_stack = set()

    def dfs(task_id):
        visited.add(task_id)
        rec_stack.add(task_id)

        task = task_map.get(task_id)
        if task:
            for dep in task.get("depends_on", []):
                if dep not in visited:
                    if dfs(dep):
                        return True
                elif dep in rec_stack:
                    return True

        rec_stack.discard(task_id)
        return False

    for s in subtasks:
        if s["id"] not in visited:
            if dfs(s["id"]):
                return True

    return False


def detect_isolated_tasks(subtasks: List[Dict]) -> List[str]:
    """
    Detect isolated tasks (no dependencies and no dependents)

    Args:
        subtasks: List of subtasks

    Returns:
        List of isolated task IDs
    """
    task_ids = {s["id"] for s in subtasks}
    depended_on = set()

    for s in subtasks:
        for dep in s.get("depends_on", []):
            depended_on.add(dep)

    isolated = []
    for s in subtasks:
        if not s.get("depends_on") and s["id"] not in depended_on:
            isolated.append(s["id"])

    return isolated


def print_score_report(score: PlanningScore):
    """Print scoring report"""
    print(f'''
{'='*60}
Planning Quality Scoring Report
{'='*60}
''')

    # Score bar
    def bar(value, max_value=100):
        filled = int(value / max_value * 20)
        return "█" * filled + "░" * (20 - filled)

    print(f'''
  Task count: {{score.task_count_score:5.1f}}/20  {{bar(score.task_count_score, 20)}}
  Dependency: {{score.dependency_score:5.1f}}/30  {{bar(score.dependency_score, 30)}}
  Tool selection:   {{score.tool_selection_score:5.1f}}/25  {{bar(score.tool_selection_score, 25)}}
  Execution:   {{score.completion_score:5.1f}}/25  {{bar(score.completion_score, 25)}}
  {{'─'*40}}
  Total:       {{score.total:5.1f}}/100  {{bar(score.total)}}

''')

    # Rating
    if score.total >= 80:
        grade = "Excellent"
    elif score.total >= 60:
        grade = "Qualified"
    else:
        grade = "Unqualified"
    print(f"  Rating: {grade}")

    # Details
    print(f'''
  Details:
''')
    for key, value in score.details.items():
        print(f"    {key}: {value}")

    # Failure modes
    if score.failure_modes:
        print(f'''
  ⚠  Failure modes: {', '.join(score.failure_modes)}
''')

    print(f"{'='*60}")


def run_demo():
    """Demo"""
    print("=" * 60)
    print("Task Planning Quality Scoring Demo")
    print("=" * 60)

    # Test case 1: Excellent planning
    result_good = {
        "success": True,
        "output": "Result is 120, stored",
        "_meta": {
            "subtasks_total": 3,
            "subtasks_success": 3,
            "subtasks_failed": 0,
            "subtasks": [
                {"id": "task_1", "description": "Calculate", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_2", "description": "Store", "tool": "memory_store", "status": "success", "depends_on": ["task_1"], "retry_count": 0},
                {"id": "task_3", "description": "Confirm", "tool": "none", "status": "success", "depends_on": ["task_2"], "retry_count": 0},
            ],
        },
    }

    # Test case 2: Problematic planning
    result_bad = {
        "success": True,
        "output": "Execution completed",
        "_meta": {
            "subtasks_total": 12,
            "subtasks_success": 8,
            "subtasks_failed": 4,
            "subtasks": [
                {"id": "task_1", "description": "Calculate", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_2", "description": "Calculate (duplicate)", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_3", "description": "Store", "tool": "memory_store", "status": "success", "depends_on": ["task_1"], "retry_count": 0},
                {"id": "task_4", "description": "Unknown tool", "tool": "web_search", "status": "failed", "depends_on": ["task_99"], "retry_count": 3},
                {"id": "task_5", "description": "Calculate (again duplicate)", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_6", "description": "Store (duplicate)", "tool": "memory_store", "status": "success", "depends_on": ["task_3"], "retry_count": 0},
                {"id": "task_7", "description": "Calculate (another duplicate)", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_8", "description": "Confirm", "tool": "none", "status": "success", "depends_on": ["task_3"], "retry_count": 0},
                {"id": "task_9", "description": "Isolated task", "tool": "none", "status": "pending", "depends_on": [], "retry_count": 0},
                {"id": "task_10", "description": "Calculate (5th time)", "tool": "calculator", "status": "success", "depends_on": [], "retry_count": 0},
                {"id": "task_11", "description": "Store (3rd time)", "tool": "memory_store", "status": "success", "depends_on": ["task_3"], "retry_count": 0},
                {"id": "task_12", "description": "Confirm (duplicate)", "tool": "none", "status": "success", "depends_on": ["task_8"], "retry_count": 0},
            ],
        },
    }

    # Test case 3: Pseudo planning
    result_pseudo = {
        "success": True,
        "output": "Execution completed",
        "_meta": {
            "subtasks_total": 8,
            "subtasks_success": 5,
            "subtasks_failed": 3,
            "subtasks": [
                {"id": "task_1", "description": "Main task", "tool": "calculator", "status": "failed", "depends_on": [], "retry_count": 0},
                {"id": "task_2", "description": "Retry main task", "tool": "calculator", "status": "failed", "depends_on": ["task_1"], "retry_count": 0},
                {"id": "task_3", "description": "Retry again", "tool": "calculator", "status": "failed", "depends_on": ["task_2"], "retry_count": 0},
                {"id": "task_4", "description": "Fallback solution", "tool": "code_executor", "status": "success", "depends_on": ["task_3"], "retry_count": 0},
                {"id": "task_5", "description": "Verify result", "tool": "none", "status": "success", "depends_on": ["task_4"], "retry_count": 0},
                {"id": "task_6", "description": "Retry previously failed", "tool": "calculator", "status": "success", "depends_on": ["task_5"], "retry_count": 0},
                {"id": "task_7", "description": "Rephrase and retry", "tool": "none", "status": "failed", "depends_on": ["task_6"], "retry_count": 0},
                {"id": "task_8", "description": "Final confirmation", "tool": "none", "status": "success", "depends_on": ["task_7"], "retry_count": 0},
            ],
        },
    }

    print("\n--- Test case 1: Excellent planning ---")
    score1 = score_task_planning(result_good)
    print_score_report(score1)

    print("\n--- Test case 2: Problematic planning ---")
    score2 = score_task_planning(result_bad)
    print_score_report(score2)

    print("\n--- Test case 3: Pseudo planning ---")
    score3 = score_task_planning(result_pseudo)
    print_score_report(score3)

    # Comparison
    print("=" * 60)
    print("Comparison Summary")
    print("=" * 60)
    print(f'{{"Metric":20s}} {{"Excellent":>10s}} {{"Problematic":>10s}} {{"Pseudo":>10s}}')
    print("-" * 60)
    print(f'{{"Total score":20s}} {{score1.total:10.1f}} {{score2.total:10.1f}} {{score3.total:10.1f}}')
    print(f'{{"Task count":20s}} {{score1.details["task_count"]:>10s}} {{score2.details["task_count"]:>10s}} {{score3.details["task_count"]:>10s}}')
    print(f'{{"Dependency":20s}} {{score1.details["dependency"]:>10s}} {{score2.details["dependency"]:>10s}} {{score3.details["dependency"]:>10s}}')
    print(f'{{"Tool selection":20s}} {{score1.details["tool_selection"]:>10s}} {{score2.details["tool_selection"]:>10s}} {{score3.details["tool_selection"]:>10s}}')
    print(f'{{"Execution":20s}} {{score1.details["completion"]:>10s}} {{score2.details["completion"]:>10s}} {{score3.details["completion"]:>10s}}')
    print(f'{{"Failure modes":20s}} {{len(score1.failure_modes):10d}} {{len(score2.failure_modes):10d}} {{len(score3.failure_modes):10d}}')
    print("=" * 60)


if __name__ == "__main__":
    run_demo()

跑出来的结果:

============================================================
Comparison Summary
============================================================
Metric                           Excellent  Problematic       Pseudo
------------------------------------------------------------
Total score                           95.0        41.0        32.0
Task count              3 (reasonable for medium task)   12 (too fine)   8 (reasonable for complex task)
Dependency            3/3 correct, 0 redundant    8/11 correct, 0 redundant    5/7 correct, 0 redundant
Tool selection              2/2 correct    6/9 correct    1/2 correct
Execution              3/3 (100%)    8/12 (67%)    5/8 (62%)
Failure modes                          2             4             2
============================================================

数据:规划质量 vs 执行完成率

对 50 个任务做规划质量评分,按总分分三组:

规划质量 任务数 平均子任务执行完成率 平均 Token 消耗 平均耗时
优秀(≥80 分) 5 94% 2,996 32.7s
合格(60-79 分) 3 78% 6,405 79.4s
不合格(<60 分) 2 52% 16,775 186.8s

说明:规划质量越高,子任务执行完成率应整体越高;若出现"分数高但完成率低",优先怀疑元数据不一致(例如 subtasks_success 与 subtasks 状态未同步)或用例混入了"输出看似成功、子任务大量跳过"的路径。上表为同一批任务按规划得分分组后的典型形态,与下文总结中的 94% / 52% 量级一致。

交付物

1. 规划质量评分细则表

指标 权重 满分条件 扣分规则 最低分
子任务数量 20 分 根据任务复杂度(简单1-3个,中等3-6个,复杂5-10个) 每少 1 个扣 5 分,每多 1 个扣 3 分 0
依赖准确性 30 分 100% 正确且无冗余 每个错误依赖扣 10 分,每个冗余依赖扣 2 分 0
工具选择 25 分 100% 正确("none"不计入) 每个错误工具扣 8 分 0
执行完成 25 分 完成率 100% 每低 10% 扣 5 分,重复任务额外扣分 0

2. 失败模式库(12 种)

# 失败模式 检测方式 严重度
1 子任务数量失控 n > 10 Medium
2 子任务过少 n < 3 Low
3 依赖环 DFS 环检测 High
4 依赖遗漏 依赖目标不在 task_ids 中 High
5 过度依赖 所有任务都依赖同一个 Medium
6 工具选择错误 工具不在 VALID_TOOLS 中 High
7 工具不存在 同上 High
8 孤立任务 无依赖也无后续 Low
9 重复子任务 描述相似度 > 80% Medium
10 规划解析失败 JSON 解析异常 Critical
11 伪规划 超过50%是retry/fallback/rephrase High
12 冗余依赖 过度保守的依赖关系 Medium

3. 测试用例集(20 个任务)

ID 任务 期望子任务数 期望工具 难度
P-01 计算 25*4+100/5 1 calculator 简单
P-02 计算并存储结果 2 calculator, memory_store 简单
P-03 计算+存储+确认 3 calculator, memory_store, none 简单
P-04 用 Python 计算斐波那契 2 code_executor 中等
P-05 分析数据生成报告 4 code_executor, calculator, none 中等
P-06 分析+报告+邮件 5 code_executor, calculator, none, web_fetch 中等
P-07 多步骤数据分析 6 code_executor, calculator, memory_store, none 复杂
P-08 竞品分析(3 个竞品) 5 web_fetch, code_executor, none 复杂
P-09 自动化测试生成 4 code_executor, none 中等
P-10 内容创作(搜索+大纲+撰写) 4 web_fetch, none 中等

(完整 20 个用例略,格式同上)

总结

任务规划是智能体最核心的能力。规划质量决定执行效率:拆得合理,执行顺利;拆得不合理,浪费资源还容易失败。

四个维度量化规划质量:子任务数量(20 分)、依赖准确性(30 分)、工具选择(25 分)、执行完成(25 分)。新增重复子任务检测和失败归因分析,使评估更加全面。

规划质量高的任务,执行完成率 94%;规划质量低的任务,完成率只有 52%。规划质量下降 30 分 ≈ 完成率腰斩。规划不是"可有可无",是决定成败的关键。


面试题模块

Q1:任务规划测试中,你重点测哪些维度?

A:四个核心维度:1) 拆解质量——复杂任务能否拆成合理的子任务(粒度太粗=规划不足,粒度太细=性能浪费);2) 依赖识别——子任务之间的依赖关系是否正确(B 依赖 A 的结果);3) 工具匹配——每个子任务是否选择了正确的工具;4) 异常处理——工具失败后是否重新规划而不是直接放弃。此外还会检测重复子任务和伪规划等问题。

Q2:规划能力的评分标准是什么?怎么量化?

A:5分制:1分=完全无法规划;2分=能拆但顺序错乱;3分=正确拆解但粒度不合适;4分=合理规划+正确工具选择;5分=最优规划+优雅的错误恢复。每项根据实际输出对照评分标准打分。同时考虑任务复杂度调整评分区间:简单任务1-3个子任务,中等任务3-6个,复杂任务5-10个。

Q3:一个常见的规划失败案例是什么?

A:常见的失败包括两种情况:1) 过度简化——智能体把复杂任务拆成 2-3 个笼统的子任务,每个子任务仍然很复杂。比如"分析销售数据"只拆成"读取数据"和"生成报告"两步,缺少中间的数据清洗、异常检测、图表生成等关键步骤。2) 伪规划——表面拆了 8 步,实际上 6 步都是 retry/fallback/rephrase,真正推进任务的只有 2 步。这种规划虽然看起来详细,但大部分步骤是应对失败的重试逻辑,而非真正的任务分解。

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐