LLM之Agent（四十九）｜用 Claude Code 打造一套可落地的sub-agent团队工程体系

wshzd

422人浏览 · 2026-05-07 18:26:16

wshzd · 2026-05-07 18:26:16 发布

🎯 开篇：为什么你的 AI 助手总在"瞎搞"？

你有没有发现，让 AI 写代码时，它经常：

跳过评审，直接提交自己"很满意"的代码
自作聪明，把简单需求搞成过度工程
省了小事，却在关键地方掉链子
从不验证，说"测试通过了"其实是编的

初级工程师和高级工程师的区别，从来不是语法熟练度，而是 可预测性、风险管理、压力下的纪律性。

Anthropic 的研究证明：AI 代理协作效果很棒，但没有结构的话，更多代理只会带来更多混乱和浪费。

真正的解决方案是什么？

像真正的科技公司那样，让一名 Staff Engineer 带领一支专业子代理团队，通过严谨的流程从设计走到部署。

今天这篇文章，手把手带你搭建这套系统。

📐 一、项目架构：先搭组织，再写代码

在 Spotify、Shopify、Stripe 这些公司，Staff Engineer 接手新项目的第一件事永远不是写功能代码，而是搭建基础设施。

我们这套系统的目录结构，映射了一个真实工程团队的组织架构：

senior_staff_engineer/
├── agents/              # 团队成员定义（Agent + 子 Agent）
├── commands/            # 自定义 CLI 风格命令
├── hooks/               # 管理层：生命周期钩子
├── skills/              # 员工手册：核心能力模块
│   ├── brainstorming/           # 头脑风暴与设计
│   ├── dispatching-parallel-agents/  # 多代理协调
│   ├── executing-plans/         # 计划分步执行
│   ├── finishing-a-development-branch/  # 分支收尾
│   ├── receiving-code-review/   # 接收代码评审
│   ├── requesting-code-review/  # 发起代码评审
│   ├── subagent-driven-development/     # 子代理开发
│   ├── systematic-debugging/    # 系统性调试
│   ├── test-driven-development/ # 测试驱动开发
│   ├── using-git-worktrees/     # Git 工作树隔离
│   ├── using-senior-staff-engineer/     # 系统配置
│   ├── verification-before-completion/  # 完成前验证
│   ├── writing-plans/           # 编写计划
│   └── writing-skills/          # 编写技能文档

💡 关键洞察：agents/ 是团队花名册，skills/ 是员工手册，hooks/ 是管理层。三者配合，才能让多个代理像一个真正的团队那样协作。

🔧 二、Hooks：管理层的"晨会制度"

每家公司都有你看不见的管理层——办公室开门、咖啡机运转、日程同步、Slack 频道更新。没人注意到它，直到它崩了。

在我们的系统里，hooks/ 就是这个管理层。

hooks.json — 主配置

{
  "hooks": {
    "SessionStart": [
      {
        "matcher": "startup|clear|compact",
        "hooks": [
          {
            "type": "command",
            "command": "\"${PROJECT_ROOT_DIR}/hooks/run-hook.cmd\" session-start",
            "async": false
          }
        ]
      }
    ]
  }
}

匹配三种场景：

startup — 新会话启动
clear — 上下文重置
compact — 记忆压缩

async: false 确保钩子完成前，代理不会开始响应。相当于"晨会结束才能开工"。

run-hook.cmd — 跨平台网关

if "%~1"=="" (
  echo run-hook.cmd: missing script name >&2
  exit /b 1
)

set "HOOK_DIR=%~dp0"

REM Try Git for Windows bash in standard locations
if exist "C:\Program Files\Git\bin\bash.exe" (
  "C:\Program Files\Git\bin\bash.exe" "%HOOK_DIR%%~1" %2 %3 %4 %5 %6 %7 %8 %9
  exit /b %ERRORLEVEL%
)

Unix 端：

# Unix: run the named script directly
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
SCRIPT_NAME="$1"
shift
exec bash "${SCRIPT_DIR}/${SCRIPT_NAME}" "$@"

原理：代理不需要知道自己在什么系统上跑，网关脚本负责路由。这跟 Docker 和 Kubernetes 的设计哲学一模一样。

session-start — 每日"员工手册"宣读

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
PLUGIN_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"

# Legacy config check
warning_message=""
legacy_skills_dir="${HOME}/.config/staff_engineer/skills"

if [ -d "$legacy_skills_dir" ]; then
  warning_message="\n\n<important-reminder>⚠️ **WARNING:**
  staff_engineer now uses Claude Code's skills system.
  Move custom skills to ~/.claude/skills instead.</important-reminder>"
fi

# Load core skill
using_staff_engineer_content=$(
  cat "${PLUGIN_ROOT}/skills/using-staff_engineer/SKILL.md" 2>&1 \
  || echo "Error reading using-staff_engineer skill"
)

内容经过 JSON 转义后注入：

session_context="<EXTREMELY_IMPORTANT>\nYou have staff_engineer.
\n\n**Below is the full content of your 'staff_engineer:using-
staff_engineer' skill - your introduction to using skills. For
all other skills, use the 'Skill' tool:**\n\n
${using_staff_engineer_escaped}\n\n${warning_escaped}\n
</EXTREMELY_IMPORTANT>"

支持 Claude Code、Cursor、Copilot CLI 三种平台格式输出。

📖 三、员工手册：1% 规则与文化

每家伟大的公司都有一份奠基文档。Amazon 的领导力准则、Netflix 的文化手册、Bridgewater 的原则——它们不描述公司做什么，而描述每个人应该如何思考。

最核心的 1% 规则

<EXTREMELY-IMPORTANT>
If you think there is even a 1% chance a skill might apply to
what you are doing, you ABSOLUTELY MUST invoke the skill.

IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE.
YOU MUST USE IT.
This is not negotiable. This is not optional. You cannot
rationalize your way out of this.
</EXTREMELY-IMPORTANT>

翻译：哪怕只有 1% 的概率某个技能适用于你的任务，你也必须调用它。这不是建议，这是铁律。

没有这个规则，代理会把技能当建议。任务"看起来简单"就跳过头脑风暴，"只是配置改动"就跳过 TDD，"就两行"就跳过代码评审。每个资深工程师都见过团队通过这样的合理化慢慢放弃流程，结果总是质量下降、Bug 上线、所有人一脸懵逼。

子代理的"免签到"条款

<SUBAGENT-STOP>
If you were dispatched as a subagent to execute a specific task,
skip this skill.
</SUBAGENT-STOP>

主代理走完整流程，子代理保持专注——相当于告诉外包人员"你不用参加全员大会，专心做你被雇来做的事"。

指令优先级（指挥链）

## Instruction Priority

1. **Users explicit instructions** (CLAUDE.md, direct requests)
   - highest priority

2. **senior_tech_engineers skills** - override default system
   behavior where they conflict

3. **Default system prompt** - lowest priority

If CLAUDE.md says "don't use TDD" and a skill says "always use
TDD," follow the users instructions. The user is in control.

人类永远有最终话语权。这区分了流程驱动文化和官僚体制：前者遵循流程因为它们产出更好结果，但在必要时可以覆盖；后者盲目遵循流程，即使结果很糟。

危险信号对照表

想法	现实
"这只是个简单问题"	问题是任务。检查技能。
"我需要更多上下文"	技能检查在澄清问题之前。
"我先探索代码库"	技能告诉你如何探索。先检查。
"这不需要正式技能"	如果技能存在，就用它。
"我记得这个技能"	技能会进化。读当前版本。
"这技能太过了"	简单的事会变复杂。用它。
"这感觉 productive"	无纪律的行动浪费时间。

🧠 四、头脑风暴与设计：先思考，再动手

Amazon 在写产品前先写新闻稿。Basecamp 先写提案文档。Google 的设计文档要经过三级工程师评审，才能写第一个函数。

硬门槛（Hard Gate）

    
    
    
  <HARD-GATE>
Do NOT invoke any implementation skill, write any code, scaffold
any project, or take any implementation action until you have
presented a design and the user has approved it. This applies to
EVERY project regardless of perceived simplicity.
</HARD-GATE>

在展示设计并获得人类批准之前，不准写任何代码。 这是 Staff Engineer 站在开发者和代码库之间说："先给我设计文档。"

"太简单不需要设计"陷阱

    
    
    
  ## Anti-Pattern: "This Is Too Simple To Need A Design"

Every project goes through this process. A todo list, a
single-function utility, a config change — all of them. "Simple"
projects are where unexamined assumptions cause the most wasted
work. The design can be short (a few sentences for truly simple
projects), but you MUST present it and get approval.

"简单"的项目恰恰是最危险的——没人停下来检查假设。配置改动影响三个服务、"快速"工具函数要处理七个边界情况、"简单"的 Todo App 其实需要离线同步和冲突解决。

设计清单（九步）

    
    
    
  ## Checklist

1. **Explore project context** - check files, docs, recent commits
2. **Offer visual companion** (if topic will involve visual questions)
3. **Ask clarifying questions** - one at a time, understand purpose/constraints/success criteria
4. **Propose 2-3 approaches** - with trade-offs and your recommendation
5. **Present design** - in sections scaled to complexity, get user approval after each section
6. **Write design doc** - save to docs/senior-staff-engineer/specs/ and commit
7. **Spec self-review** - check for placeholders, contradictions, ambiguity
8. **User reviews written spec** - ask user to review before proceeding
9. **Transition to implementation** - invoke writing-plans skill

一次只问一个问题

    
    
    
  - Ask questions one at a time to refine the idea
- Prefer multiple choice questions when possible
- Only one question per message
- Focus on understanding: purpose, constraints, success criteria

资深工程师不会问"你想要什么？"，而是说"这里有三个选项，我推荐 B 因为 X 和 Y，你怎么看？"

范围检测（Scope Detection）

    
    
    
  Before asking detailed questions, assess scope: if the request
describes multiple独立子系统 (e.g., "build a platform
with chat, file storage, billing, and analytics"), flag this
immediately.

Don't spend questions refining details of a project that needs to be decomposed first.

初级工程师听到"做个带聊天、文件存储、计费和分析的平台"，开始问聊天消息格式。Staff Engineer 听到同样的话会说"这是四个独立系统，我们先搞清楚从哪个开始。"

设计用于隔离

    
    
    
  - Break the system into smaller units that each have one clear
  purpose, communicate through well-defined interfaces, and can
  be understood and tested independently

- For each unit, you should be able to answer: what does it do,
  how do you use it, and what does it depend on?

- Can someone understand what a unit does without reading its
  internals? Can you change the internals without breaking
  consumers? If not, the boundaries need work.

设计完成后

    
    
    
  - Write the validated design (spec) to
  `docs/senior-staff-engineer/specs/YYYY-MM-DD-<topic>-design.md`

- Commit the design document to git

自检清单：

    
    
    
  **Spec Self-Review:**

1. **Placeholder scan:** Any "TBD", "TODO", incomplete sections?
2. **Internal consistency:** Do any sections contradict each other?
3. **Scope check:** Focused enough for a single implementation plan?
4. **Ambiguity check:** Could any requirement be interpreted two
   different ways? If so, pick one and make it explicit.

可视化伴侣（Visual Companion）

当讨论涉及 UI 布局、架构图或视觉对比时，系统会启动一个本地轻量服务器：

    
    
    
  # Start server with persistence
scripts/start-server.sh --project-dir /path/to/project

# Returns: {"type":"server-started","port":52341,
# "url":"http://localhost:52341",
# "screen_dir":"...content", "state_dir":"...state"}

代理写入 HTML 内容片段，用户在浏览器中查看并点击选项，选择记录为 JSON：

    
    
    
  {"type":"click","choice":"a","text":"Option A","timestamp":1706000101}
{"type":"click","choice":"c","text":"Option C","timestamp":1706000108}
{"type":"click","choice":"b","text":"Option B","timestamp":1706000115}

点击路径揭示了犹豫——尝试了 A，跳到 C，最终选了 B。好的工程师会注意到并问"你最初被 A 吸引，是什么让你改变了？"

🔍 五、规格评审：找出逻辑漏洞

每家公司都有那个在设计评审会上问"数据库挂了用户重试三次会怎样？"的人。我们用独立的子代理来扮演这个角色。

    
    
    
  Task tool (general-purpose):
  description: "Review spec document"
  prompt: |
    You are a spec document reviewer. Verify this spec is complete
    and ready for planning.
    **Spec to review:** [SPEC_FILE_PATH]

评审维度：

• 完整性：TODO、占位符、未完成部分
• 一致性：内部矛盾、冲突需求
• 清晰性：需求是否模糊到会导致建错东西
• 范围：是否聚焦到可单个计划实现
• YAGNI：未要求的功能、过度工程

关键校准：

    
    
    
  Only flag issues that would cause real problems during
implementation planning.

A missing section, a contradiction, or a requirement so ambiguous
it could be interpreted two different ways — those are issues.

Minor wording improvements, stylistic preferences, and "sections
less detailed than others" are not.

📝 六、编写计划：Backlog Grooming

敏捷团队中的"Backlog Grooming"——技术主管把已批准的架构拆成 ticket，清晰到任何开发者拿起来就能开工，不用问一个问题。

计划哲学

    
    
    
  Write comprehensive implementation plans assuming the engineer has
zero context for our codebase and questionable taste. Document
everything they need to know: which files to touch for each task,
code, testing, docs they might need to check, how to test it.

"零上下文且品味可疑"——这是资深技术主管写 ticket 的方式。

2-5 分钟规则

    
    
    
  ## Bite-Sized Task Granularity

**Each step is one action (2-5 minutes):**
- "Write the failing test" - step
- "Run it to make sure it fails" - step
- "Implement the minimal code to make the test pass" - step
- "Run the tests and make sure they pass" - step
- "Commit" - step

如果一个子代理执行任务超过 5 分钟，说明任务太宽，代理会开始做假设或失去专注。

任务格式

    
    
    
  ### Task N: [Component Name]

**Files:**
- Create: `exact/path/to/file.py`
- Modify: `exact/path/to/existing.py:123-145`
- Test: `tests/exact/path/to/test.py`

- [ ] **Step 1: Write the failing test**

```python
def test_specific_behavior():
    result = function(input)
    assert result == expected

• Step 2: Run test to verify it fails

Run: pytest tests/path/test.py::test_name -v
Expected: FAIL with "function not defined"

    
    
    
  
### 零容忍占位符

```markdown
These are **plan failures** — never write them:

- "TBD", "TODO", "implement later", "fill in details"
- "Add appropriate error handling" / "add validation"
- "Write tests for the above" (without actual test code)
- "Similar to Task N" (repeat the code — the engineer may be
  reading tasks out of order)
- References to types, functions, or methods not defined in any task

类型一致性检查

    
    
    
  **3. Type consistency:** Do the types, method signatures, and
property names you used in later tasks match what you defined in
earlier tasks?

🚀 七、子代理驱动开发：委派的艺术

在工程组织中，经理不会自己写所有代码。他们雇佣专家、给清晰指令、检查结果、管理交接。

核心原则

    
    
    
  **Why subagents:** You delegate tasks to specialized agents with
isolated context. By precisely crafting their instructions and
context, you ensure they stay focused and succeed at their task.
They should never inherit your session's context or history — you
construct exactly what they need.

上下文隔离。人类开发者拿 ticket 时不需要整个公司历史，只需要 ticket、相关代码和编码规范。这里也一样——全新代理、干净头脑、聚焦任务。

模型选择（雇佣合适级别）

    
    
    
  ## Model Selection

**Mechanical implementation tasks** (isolated functions, clear specs,
1-2 files): use a fast, cheap model.

**Integration and judgment tasks** (multi-file coordination, pattern
matching, debugging): use a standard model.

**Architecture, design, and review tasks**: use the most capable
available model.

就像分配 ticket 给初级、中级或高级工程师。管理良好的计划意味着大多数任务是"机械性"的——便宜且快速。

处理失败

四种返回状态：

    
    
    
  **DONE:** Proceed to spec compliance review.

**DONE_WITH_CONCERNS:** The implementer completed the work but
flagged doubts. Read the concerns before proceeding.

**NEEDS_CONTEXT:** The implementer needs information that wasn't
provided. Provide the missing context and re-dispatch.

**BLOCKED:** The implementer cannot complete the task.

铁律：

    
    
    
  **Never** ignore an escalation or force the same model to retry
without changes. If the implementer said it's stuck, something
needs to change.

最烂的管理者对"我卡住了"的回应是"再试试"。我们的系统明确禁止这个。如果代理卡住了，经理有三个选择：提供更多上下文、升级模型、或进一步拆分任务。

👨‍💻 八、开发者工作描述：实现者提示

公司雇佣开发者时，有入职文档、编码标准和第一个任务。这就是每个子代理收到的"工作描述"。

开始前的提问许可

    
    
    
  ## Before You Begin

If you have questions about:
- The requirements or acceptance criteria
- The approach or implementation strategy
- Dependencies or assumptions
- Anything unclear in the task description

**Ask them now.** Raise any concerns before starting work.

"不要猜"规则

    
    
    
  **While you work:** If you encounter something unexpected or
unclear, **ask questions**. It's always OK to pause and clarify.
Don't guess or make assumptions.

组织中代价最高的错误来自"假设自己知道需要什么"的开发者。一个停下来问"我不清楚空值处理"的子代理花 5 分钟，一个猜错的子代理花一小时返工。

允许失败

    
    
    
  ## When You're in Over Your Head

It is always OK to stop and say "this is too hard for me."
Bad work is worse than no work. You will not be penalized
for escalating.

这是优秀工程文化构建的"心理安全感"。

✅ 九、双重评审：规格合规 + 代码质量

实现完成后，工作要通过两个独立评审代理。顺序很重要——先做规格合规，再做代码质量。你不需要给建错的东西抛光。

规格合规评审员

    
    
    
  ## CRITICAL: Do Not Trust the Report

The implementer finished suspiciously quickly. Their report may be
incomplete, inaccurate, or optimistic. You MUST verify everything
independently.

**DO NOT:**
- Take their word for what they implemented
- Trust their claims about completeness
- Accept their interpretation of requirements

**DO:**
- Read the actual code they wrote
- Compare actual implementation to requirements line by line

这是不接受自审计报告的合规官。检查三项：遗漏需求、额外工作、误解需求。每个发现都带 file:line 引用——没有模糊反馈。

代码质量评审员

仅在规格合规通过后触发：

    
    
    
  - Does each file have one clear responsibility with a well-defined
  interface?

- Are units decomposed so they can be understood and tested
  independently?

- Is the implementation following the file structure from the plan?

- Did this implementation create new files that are already large?

两阶段评审映射了真实组织的功能评审（"做对的事了吗？"）和技术评审（"事做得对吗？"）。功能正确但架构糟糕的功能需要两个视角。

🧪 十、测试驱动开发：铁律

Google、Microsoft、Stripe、Shopopify——区分专业人士和爱好者的不是框架或算法，而是压力下的纪律。

核心原则

    
    
    
  **Core principle:** If you didn't watch the test fail, you don't
know if it tests the right thing.

如果你写测试后代码立刻通过了，你证明了什么？什么都没证明。 测试可能在测完全错误的东西。

Red-Green-Refactor 周期

RED — 写一个最小失败测试：

    
    
    
  test('retries failed operations 3 times', async () => {
  let attempts = 0;
  const operation = () => {
    attempts++;
    if (attempts < 3) throw new Error('fail');
    return 'success';
  };

  const result = await retryOperation(operation);
  expect(result).toBe('success');
  expect(attempts).toBe(3);
});

GREEN — 写最简单的通过代码：

    
    
    
  async function retryOperation<T>(fn: () => Promise<T>): Promise<T> {
  for (let i = 0; i < 3; i++) {
    try {
      return await fn();
    } catch (e) {
      if (i === 2) throw e;
    }
  }
  throw new Error('unreachable');
}

删除规则

    
    
    
  Write code before the test? Delete it. Start over.

**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete

    
    
    
  | Excuse | Reality |
|--------|---------|
| "Deleting X hours is wasteful" | Sunk cost fallacy. Keeping unverified code is technical debt. |
| "Keep as reference, write tests first" | You will adapt it. That's testing after. Delete means delete. |
| "TDD will slow me down" | TDD faster than debugging. Pragmatic = test-first. |

为什么"先写代码再补测试"不一样

    
    
    
  Tests-after answer "What does this do?"
Tests-first answer "What should this do?"

Tests-after are biased by your implementation. You test what you
built, not what's required. You verify remembered edge cases, not
discovered ones.

先写代码，你的测试被你的实现塑造。先写测试，你被迫在实现前思考行为——"邮箱为空时应该发生什么？"——这会发现你否则会遗漏的边界情况。

🐛 十一、系统性调试：法医工程

每个工程组织都有两种调试者：

• 看到错误 → 猜修复 → 试试 → 不行 → 再猜 → 折腾几小时
• 资深 Staff Engineer："停。让我先读错误信息。"

铁律

    
    
    
  NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

If you haven't completed Phase 1, you cannot propose fixes.

四阶段调查

Phase 1: 根因调查（碰任何东西之前）

    
    
    
  1. **Read Error Messages Carefully**
   - Don't skip past errors or warnings
   - They often contain the exact solution
   - Read stack traces completely

2. **Reproduce Consistently**
   - Can you trigger it reliably?
   - If not reproducible → gather more data, don't guess

3. **Check Recent Changes**
   - What changed that could cause this?
   - Git diff, recent commits

4. **Gather Evidence in Multi-Component Systems**

多组件系统需要在每个组件边界添加诊断：

    
    
    
  # Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"

# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains

Phase 3: 假设与测试

    
    
    
  1. **Form Single Hypothesis**
   - State clearly: "I think X is the root cause because Y"
   - Write it down
   - Be specific, not vague

2. **Test Minimally**
   - Make the SMALLEST possible change to test hypothesis
   - One variable at a time
   - Don't fix multiple things at once

三修复规则

    
    
    
  - Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new information
- **If ≥ 3: STOP and question the architecture**
- DON'T attempt Fix #4 without architectural discussion

**Pattern indicating architectural problem:**
- Each fix reveals new shared state/coupling/problem
- Fixes require "massive refactoring" to implement
- Each fix creates new symptoms elsewhere

修复一个接一个暴露新问题时，你不是在应对 bug，你是在应对设计缺陷。初级工程师继续打补丁，Staff Engineer 停下来："我们需要重新设计这个。"

🔬 十二、根因追踪与纵深防御

五层追踪示例

    
    
    
  1. `git init` runs in `process.cwd()` ← empty cwd parameter
2. WorktreeManager called with empty projectDir
3. Session.create() passed empty string
4. Test accessed `context.tempDir` before beforeEach
5. setupCoreTest() returns `{ tempDir: '' }` initially

错误出现在第 1 层（git 在错误目录运行）。初级会在那里修，加个回退目录。但真正的 bug 在第 5 层——变量在初始化前被访问。修第 1 层会掩盖 bug，修第 5 层才能根除。

纵深防御

    
    
    
  Single validation: "We fixed the bug"
Multiple layers: "We made the bug impossible"

找到空 projectDir bug 后，团队加了四层验证：

    
    
    
  - Layer 1: `Project.create()` validates not empty/exists/writable
- Layer 2: `WorkspaceManager` validates projectDir not empty
- Layer 3: `WorktreeManager` refuses git init outside tmpdir in tests
- Layer 4: Stack trace logging before git init

第 3 层的环境守卫：

    
    
    
  if (process.env.NODE_ENV === 'test') {
  const normalized = normalize(resolve(directory));
  const tmpDir = normalize(resolve(tmpdir()));

  if (!normalized.startsWith(tmpDir)) {
    throw new Error(
      `Refusing git init outside temp dir during tests: ${directory}`
    );
  }
}

即使第 1、2 层被绕过，第 3 层让测试中在临时目录外运行 git 操作成为物理上不可能。结果：1847 个测试全部通过，零污染。

✋ 十三、完成前验证：证据说话

航空、医疗、金融行业有"基于证据的验证"概念。波音不会因为工程师说"我确定机翼压力测试通过了"就交付飞机。他们有签字文档、时间戳、工程师姓名。

铁律

    
    
    
  Claiming work is complete without verification is dishonesty,
not efficiency.

**Core principle:** Evidence before claims, always.

    
    
    
  NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE

If you haven't run the verification command in this message,
you cannot claim it passes.

"新鲜"这个词很关键。 不是"我之前跑过"，不是"上次通过了"。是现在、此刻、在这条消息里。

验证门

    
    
    
  BEFORE claiming any status or expressing satisfaction:

1. IDENTIFY: What command proves this claim?
2. RUN: Execute the FULL command (fresh, complete)
3. READ: Full output, check exit code, count failures
4. VERIFY: Does output confirm the claim?
   - If NO: State actual status with evidence
   - If YES: State claim WITH evidence
5. ONLY THEN: Make the claim

Skip any step = lying, not verifying

合理化预防

    
    
    
  | Excuse | Reality |
|--------|---------|
| "Should work now" | RUN the verification |
| "I'm confident" | Confidence ≠ evidence |
| "Just this once" | No exceptions |
| "Agent said success" | Verify independently |
| "I'm tired" | Exhaustion ≠ excuse |
| "Different words so rule doesn't apply" | Spirit over letter |

最后一个——"用不同措辞所以规则不适用"——捕获了代理试图说"实现看起来正确"而非"测试通过"来回避验证门的行为。

👥 十四、代码评审：同行评审会议

Google 的代码评审是强制性的——没有例外，即使是资深 Staff Engineer。每双眼睛捕捉不同类别的 bug。

评审员（Senior Peer Reviewer）

五个评审维度：

    
    
    
  1. **Plan Alignment Analysis** - Compare implementation against
   original planning document

2. **Code Quality Assessment** - Patterns, error handling, type
   safety, test coverage

3. **Architecture and Design Review** - SOLID principles,
   separation of concerns, coupling

4. **Documentation and Standards** - Comments, file headers,
   conventions

5. **Issue Identification** - Critical (must fix), Important
   (should fix), Suggestions (nice to have)

请求评审

    
    
    
  **Mandatory:**
- After each task in subagent-driven development
- After completing major feature
- Before merge to main

**Optional but valuable:**
- When stuck (fresh perspective)
- Before refactoring (baseline check)
- After fixing complex bug

评审代理只拿到 git SHAs、任务描述和计划要求。没有会话历史、没有对话上下文。只有工作成果和规格。

接收评审：协作礼仪

禁止的回应：

    
    
    
  **NEVER:**
- "You are absolutely right!"
- "Great point!" / "Excellent feedback!"
- "Let me implement that now" (before verification)

**INSTEAD:**
- Restate the technical requirement
- Ask clarifying questions
- Push back with technical reasoning if wrong
- Just start working (actions > words)

技术验证（好）vs 表演式同意（坏）

    
    
    
  **Performative Agreement (Bad):**
Reviewer: "Remove legacy code"
❌ "You're absolutely right! Let me remove that..."

**Technical Verification (Good):**
Reviewer: "Remove legacy code"
✅ "Checking... build target is 10.15+, this API needs 13+.
    Need legacy for backward compat. Current impl has wrong
    bundle ID - fix it or drop pre-13 support?"

第二个回应展示了真正的工程精神：没有盲目同意，没有防御性拒绝。检查了事实，发现了细微差别（向后兼容需求），提出了选项。

YAGNI 检查

    
    
    
  IF reviewer suggests "implementing properly":
  grep codebase for actual usage

IF unused: "This endpoint isn't called. Remove it (YAGNI)?"
IF used: Then implement properly

组织层级原则

    
    
    
  "You and reviewer both report to me. If we don't need this
feature, don't add it."

🌲 十五、Git Worktrees：生物危害隔离

生物学实验室有"隔离等级"概念。研究人员不在吃午饭的同一个台子上处理危险病原体。他们在隔离室里工作。

软件工程也有同样的问题。Git worktrees 通过创建物理上独立的目录来解决——你可以同时有 main 和 feature/auth 两个活跃目录。

目录选择

    
    
    
  ### 1. Check Existing Directories
ls -d .worktrees 2>/dev/null  # Preferred (hidden)
ls -d worktrees 2>/dev/null   # Alternative

**If found:** Use that directory. If both exist, `.worktrees/` wins.

### 2. Check CLAUDE.md
grep -i "worktree.*director" CLAUDE.md 2>/dev/null

**If preference specified:** Use it without asking.

### 3. Ask User

安全检查：必须验证 .gitignore

    
    
    
  **MUST verify directory is ignored before creating worktree:**

git check-ignore -q .worktrees 2>/dev/null || \
git check-ignore -q worktrees 2>/dev/null

**If NOT ignored:**
1. Add appropriate line to .gitignore
2. Commit the change
3. Proceed with worktree creation

**Why critical:** Prevents accidentally committing worktree
contents to repository.

干净基线

    
    
    
  ### 4. Verify Clean Baseline

Run tests to ensure worktree starts clean:
npm test / cargo test / pytest / go test ./...

**If tests fail:** Report failures, ask whether to proceed
or investigate.

**If tests pass:** Report ready.

自动检测项目类型：

    
    
    
  # Node.js
if [ -f package.json ]; then npm install; fi

# Rust
if [ -f Cargo.toml ]; then cargo build; fi

# Python
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi

# Go
if [ -f go.mod ]; then go mod download; fi

分支收尾流程

    
    
    
  **Core principle:** Verify tests → Present options → Execute
choice → Clean up.

第一步：验证测试

    
    
    
  ### Step 1: Verify Tests

**Before presenting options, verify tests pass:**
npm test / cargo test / pytest / go test ./...

**If tests fail:**
Tests failing (<N> failures). Must fix before completing:

[Show failures]
Cannot proceed with merge/PR until tests pass.

Stop. Don't proceed to Step 2.

第二步：结构化选项

    
    
    
  Implementation complete. What would you like to do?

1. Merge back to <base-branch> locally
2. Push and create a Pull Request
3. Keep the branch as-is (I will handle it later)
4. Discard this work

Which option?

选项 1（本地合并）：

    
    
    
  # Switch to base branch
git checkout <base-branch>

# Pull latest
git pull

# Merge feature branch
git merge <feature-branch>

# Verify tests on merged result
<test command>

# If tests pass
git branch -d <feature-branch>

选项 2（创建 PR）：

    
    
    
  gh pr create --title "<title>" --body "$(cat <<'EOF'

## Summary
<2-3 bullets of what changed>

## Test Plan
- [ ] <verification steps>
EOF
)"

选项 4（丢弃）：需要输入精确确认词

    
    
    
  **Confirm first:**
This will permanently delete:
- Branch <name>
- All commits: <commit-list>
- Worktree at <path>

Type 'discard' to confirm.
Wait for exact confirmation.

🎓 十六、编写技能：学院制

Google 有"Noogler"入职培训。Stripe 的内部文档标准让新工程师几周内就能上手复杂的支付系统。

伟大的培训部门不只是写文档然后祈祷人们遵守。他们测试文档。 把新人放在文档前，看他们尝试跟随后在哪里绊倒，然后修复每一个绊脚点。

技能即测试驱动开发

    
    
    
  **Writing skills IS Test-Driven Development applied to process
documentation.**

You write test cases (pressure scenarios with subagents), watch
them fail (baseline behavior), write the skill (documentation),
watch tests pass (agents comply), and refactor (close loopholes).

**Core principle:** If you didn't watch an agent fail without the
skill, you don't know if the skill teaches the right thing.

文档 TDD 映射

    
    
    
  | TDD Concept      | Skill Creation                    |
|------------------|-----------------------------------|
| Test case        | Pressure scenario with subagent   |
| Production code  | Skill document (SKILL.md)         |
| Test fails (RED) | Agent violates rule without skill |
| Test passes      | Agent complies with skill present |
| Refactor         | Close loopholes                   |

技能类型不同，测试方式不同

    
    
    
  ### Discipline-Enforcing Skills (rules/requirements)
**Test with:**
- Academic questions: Do they understand the rules?
- Pressure scenarios: Do they comply under stress?
- Multiple pressures combined: time + sunk cost + exhaustion

### Technique Skills (how-to guides)
**Test with:**
- Application scenarios: Can they apply the technique correctly?
- Missing information tests: Do instructions have gaps?

### Reference Skills (documentation/APIs)
**Test with:**
- Retrieval scenarios: Can they find the right information?
- Gap testing: Are common use cases covered?

防弹合理化

    
    
    
  # Bad
Write code before test? Delete it.

# Good
Write code before test? Delete it. Start over.

**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete

    
    
    
  **Violating the letter of the rules is violating the spirit
of the rules.**

发现的真实 bug

    
    
    
  **CRITICAL: Description = When to Use, NOT What the Skill Does**

Testing revealed that when a description summarizes the skill's
workflow, Claude may follow the description instead of reading the
full skill content. A description saying "code review between tasks"
caused Claude to do ONE review, even though the skill's flowchart
clearly showed TWO reviews (spec compliance then code quality).

修复：描述只写触发条件，绝不写工作流程总结。

    
    
    
  # Bad: Summarizes workflow, agent may follow this shortcut
description: Use when executing plans - dispatches subagent per
  task with code review between tasks

# Good: Just triggering conditions, no workflow summary
description: Use when executing implementation plans with
  independent tasks in the current session

Token 效率

    
    
    
  **Target word counts:**
- getting-started workflows: <150 words each
- Frequently-loaded skills: <200 words total
- Other skills: <500 words (still be concise)

👤 十七、代理与命令：团队目录

代码评审员代理

    
    
    
  agents/code-reviewer.md

"Senior Code Reviewer, expertise in software architecture, 
design patterns, and best practices."

已弃用命令（优雅的迁移）

    
    
    
  # brainstorm.md
---
description: "Deprecated - use the senior-staff-engineer:brainstorming
skill instead"
---

Tell your human partner that this command is deprecated and will
be removed in the next major release.

    
    
    
  # write-plan.md
---
description: "Deprecated - use the senior-staff-engineer:writing-plans
skill instead"
---

Tell your human partner that this command is deprecated and will
be removed in the next major release.

    
    
    
  # execute-plan.md
---
description: "Deprecated - use the senior-staff-engineer:executing-plans
skill instead"
---

Tell your human partner that this command is deprecated and will
be removed in the next major release.

这是每个长寿软件项目都会经历的模式。 早期建快捷方式，因为它们快且简单。系统成熟后，发现这些快捷方式绕过了重要流程。所以团队弃用了它们——不是删除，而是保留旧接口，让它解释变更，给人们适应时间。

🧪 十八、实战测试

作者用真实项目测试了整个系统——构建一个基于 YouTube 转录稿的交互式技能关联可视化指南。

第一步：探索

代理没有直接跳到规划或架构。它安静地理解现有代码——分析文件夹结构、检查转录稿存在方式、阅读文档。这是"探索项目上下文"步骤，而且做得很彻底。

第二步：提供可视化伴侣

代理主动建议某些设计问题用浏览器可视化回答更好。这是头脑风暴技能教它评估"每个问题"是否需要视觉辅助，然后自己做出的判断。

第三步：探索总结后提问

在问第一个澄清问题前，代理总结了探索发现：56 份原始转录稿、88 个 wiki 页面、通过 Obsidian wikilinks 的交叉引用结构。先展示功课，再要方向。

第四步：苏格拉底式提问

一次一个问题：

1. "学生应该如何主要发现内容？"
2. "学生点击节点时，详情视图应该显示什么？"
3. "你发布新视频时，希望如何更新这个系统？"

每个问题建立在前一个答案上。没有问题轰炸。一个思路，追到底。

第五步：可视化呈现方案

代理写 HTML 到视觉伴侣，展示三种不同的 Knowledge Explorer 方案，每个带权衡，在浏览器中渲染，让用户可以看到和点击，而不是阅读和想象。

结果：力导向图布局，Claude Code、MCP、Token Management、RAG、Skills 等节点，按类别颜色编码。可拖拽、可缩放、可点击看详情。

整个会话到此为止都是纯粹的思考。 没有实现。没有脚手架。没有"让我设置项目结构"。只是一个探索了代码库、问了聚焦问题、呈现了视觉方案等待人类批准的代理。

这就是真正遵循纪律流程的样子。

🚀 十九、如何进一步改进

1. 添加跨会话记忆：当前每个会话都是全新的。持久化记忆层跟踪过去的设计决策、已知 bug 和架构模式，让代理像真正的资深工程师一样积累机构知识。
2. 成本感知模型路由：当前模型选择依赖经理判断。添加实际的 token 成本跟踪，自动将简单任务路由到更便宜的模型。
3. 代理间通信：当前子代理只向经理汇报。添加并行代理间的消息传递系统，让它们共享发现。
4. 技能的自动回归测试：技能创建时手动压力测试，但随着底层模型变化可能漂移。CI 流水线每晚运行压力场景，标记代理何时开始绕过技能防御。
5. 人类反馈循环集成：当人类推翻决策或拒绝设计时，该信号应反馈回技能系统，强化即将被违反的具体规则，让团队从纠正中学习。

📌 总结

这套系统的核心思想很简单：让 AI 代理像真正的工程团队一样工作。

不是更快，而是更可预测。不是更多功能，而是更少 bug。不是更聪明，而是更有纪律。

关键要点回顾：

✅ 1% 规则 — 只要 1% 概率适用，就必须调用技能
✅ 硬门槛 — 设计获批前不准写代码
✅ 上下文隔离 — 子代理不继承主代理的上下文
✅ TDD 铁律 — 没看测试失败就不算 TDD
✅ 双重评审 — 规格合规 + 代码质量，顺序不能乱
✅ 验证门 — 没有新鲜证据就不准声称完成
✅ 三修复规则 — 三次修复后必须质疑架构
✅ 纵深防御 — 不止修复 bug，要让 bug 不可能发生