从Prompt工程到Harness工程：大语言模型系统的下一波进化

程序员小橙

274人浏览 · 2026-04-27 10:23:39

程序员小橙 · 2026-04-27 10:23:39 发布

回顾 2023 年初。人们写提示词就像在念咒——思维链、少样本示例、扮演斯坦福教授。说实话？确实管用。至少管用了一阵子。

但行业正在慢慢意识到一个让人不太舒服的事实：提示词工程，单靠它自己，是没法规模化 ( Scale ) 的。

提示词没法规模化，因为问题从来不只是你对大语言模型说了什么。问题在于——提示词发出之前会发生什么，生成过程中会发生什么，模型返回响应之后会发生什么，模型出错时怎么办，用户做了意料之外的操作时怎么办，数据变了怎么办，业务需求调整了怎么办。生产级的大语言模型系统不是一个提示词。它是一个系统——而系统需要的是工程。

这篇文章讲的就是下一步。我把它叫做** Harness工程 ( Harness Engineering )**，我想论证的是，真正的工作——以及真正的价值——现在就在这里。

1、为什么提示词工程已经不够用了

我不想泛泛而谈，因为空洞的批评太廉价了。

可靠性问题

提示词天然就是脆弱的。在 gpt-4-turbo 上精心调优的提示词，模型一更新就开始表现异常。你为某个用例调了一句话，不知不觉就搞坏了另一个用例。根本不存在什么单元测试能回答「这个提示词半年后还有没有意义？」

来看一个具体场景：你做了一个客服聊天机器人。花了两周打磨完美的系统提示词。它处理了各种边界情况。它很礼貌。它知道不能做退款承诺。你部署上线了。

两个月后，产品有了新功能。退款政策改了。新版模型发布了。而你没有任何系统化的方法知道你的提示词还能不能正常工作。

评估真空

大多数团队对大语言模型输出的评估方式就是——肉眼看。有人读一批回复样本，然后说「嗯，看着还行。」这跟你把软件在浏览器里随便点十分钟就上线没什么区别。

真正的软件工程有回归测试 ( Regression Test )、集成测试、性能基准。大多数公司的大语言模型系统呢……只有一个 Slack 频道，大家往里贴翻车截图。

可观测性缺口 ( Observability Gap )

传统 API 出了问题，你有堆栈跟踪、日志、错误码。而大语言模型给出了一个微妙的错误答案——自信地、流畅地、看似合理地——你甚至可能根本不知道出了问题。这种故障模式是隐形的。

组合问题

提示词不能组合。你不能 import prompt_A 和 prompt_B 然后指望它们干净地协同工作。但真正的应用需要大语言模型做很多事：检索上下文、基于上下文推理、调用工具、验证输出、处理失败、重试。单个提示词没法可靠地完成所有这些，而用胶带把提示词粘在一起，那不叫工程。

2、Harness工程登场

那什么是Harness工程？

这么想：提示词告诉模型做什么。Harness则是包围模型的一切——那些确保系统无论模型做了什么或没做什么，都能正确、可预测、安全地运行的脚手架。

「Harness」这个词是有意为之的。在电气工程中，Harness负责组织和保护电缆，确保电力在复杂系统中正确路由。在软件测试中，测试Harness ( Test Harness ) 用验证代码所需的基础设施包裹你的代码。在这两种情况下，Harness都不是要替代组件——它是来管控组件的。

Harness工程就是围绕大语言模型构建基础设施的学科：评估循环、护栏、编排层、可观测性管道、降级机制。它把大语言模型当作更大系统中的一个组件来对待，而不是把大语言模型本身就当作系统。

让我把这个拆解为三个核心支柱。

支柱一：评估循环——像成年人一样测试大语言模型

如果你没有评估 ( Eval )，你就没有生产系统。你只有一个演示。

评估循环 ( Eval Loop ) 是一个系统化的自动化流程，用来衡量你的大语言模型是否在做你期望的事。它回答两个问题：「我怎么知道我的系统还在正常工作？」和「我怎么知道我最近的改动是让事情变好了还是变差了？」

一个最小化的评估循环（Python 实现）

让我们从头搭建一个。假设你有一个摘要系统，你想知道你的提示词是否在生成好的摘要。

Copyimport anthropic
import json
from dataclasses import dataclass
from typing import Callable

@dataclass
class EvalCase:
    input_text: str
    expected_properties: dict  # what we expect, not exact strings

def summarize(client: anthropic.Anthropic, text: str) -> str:
    """The system under test."""
    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        messages=[
            {
                "role": "user",
                "content": f"Summarize the following text in 2-3 sentences:\n\n{text}"
            }
        ]
    )
    return message.content[0].text

def llm_judge(client: anthropic.Anthropic, output: str, criteria: dict) -> dict:
    """
    Use another LLM call to evaluate the output.
    This is the 'LLM as judge' pattern - powerful for subjective criteria.
    """
    judge_prompt = f"""
    You are evaluating the quality of a text summary.
    Summary to evaluate:
    ---
    {output}
    ---
    Evaluate against these criteria and respond ONLY with a JSON object:
    {json.dumps(criteria, indent=2)}
    For each criterion, provide a score from 0-10 and a brief reason.
    Format: {{"criterion_name": {{"score": 8, "reason": "..."}}}}
    """
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        return {}

def run_eval_suite(client: anthropic.Anthropic, cases: list[EvalCase]) -> dict:
    """Run the full evaluation suite and return aggregated results."""
    results = []
    for i, casein enumerate(cases):
        print(f"Running case {i + 1}/{len(cases)}...")
        output = summarize(client, case.input_text)
        scores = llm_judge(client, output, case.expected_properties)
        results.append({
            "input_preview": case.input_text[:100] + "...",
            "output": output,
            "scores": scores
        })
    # Aggregate
    all_scores = []
    for result inresults:
        for criterion, data in result["scores"].items():
            if isinstance(data, dict) and"score"indata:
                all_scores.append(data["score"])
    return {
        "total_cases": len(cases),
        "average_score": sum(all_scores) / len(all_scores) if all_scores else0,
        "details": results
    }

# --- Example Usage ---
if __name__ == "__main__":
    client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var
    eval_cases = [
        EvalCase(
            input_text="""
            Quantum computing uses quantum mechanical phenomena like superposition and entanglement
            to process information in fundamentally different ways than classical computers.
            While classical computers use bits (0 or 1), quantum computers use qubits which can
            exist in multiple states simultaneously. This allows quantum computers to solve certain
            problems exponentially faster, particularly in cryptography, drug discovery, and
            optimization problems. However, qubits are extremely fragile and maintaining quantum
            coherence is one of the biggest engineering challenges in the field.
            """,
            expected_properties={
                "conciseness": "Is the summary 2-3 sentences and appropriately brief?",
                "accuracy": "Does it accurately capture the core concepts without introducing errors?",
                "clarity": "Is it understandable to a non-technical reader?"
            }
        ),
        EvalCase(
            input_text="""
            The French Revolution, beginning in 1789, fundamentally transformed France and
            influenced the entire world. Driven by Enlightenment ideas, financial crisis, and
            social inequality, French citizens overthrew the monarchy. The revolution produced
            the Declaration of the Rights of Man, abolished feudalism, and eventually led to
            the rise of Napoleon Bonaparte. Its ideals of liberty, equality, and fraternity
            became foundational to modern democratic thought.
            """,
            expected_properties={
                "conciseness": "Is the summary 2-3 sentences and appropriately brief?",
                "accuracy": "Does it accurately capture key events and their significance?",
                "clarity": "Is it understandable and well-organized?"
            }
        )
    ]
    report = run_eval_suite(client, eval_cases)
    print(f"\n=== EVAL REPORT ===")
    print(f"Cases run: {report['total_cases']}")
    print(f"Average score: {report['average_score']:.1f}/10")

这是一个真实评估系统的骨架。在生产环境中，你会把这些结果存到数据库里，随时间追踪，分数低于阈值时触发告警。Braintrust 和 LangSmith 这样的公司已经围绕这个模式构建了完整的平台。

关键洞察：你的评估套件是你和大语言模型系统之间的契约。 你做的每一个改动——提示词更新、模型替换、上下文修改——在上线之前都要经过这份契约的检验。

支柱二：护栏——因为模型一定会给你「惊喜」

护栏 ( Guardrails ) 是围绕大语言模型调用的验证、过滤和安全层。它们回答的问题是：「即使模型做了意料之外的事，我怎么防止它变成用户可见的故障或业务风险？」

有两种值得了解的护栏：

**输入护栏 ( Input Guardrails )**：验证或转换进入模型的内容
**输出护栏 ( Output Guardrails )**：验证或转换从模型输出的内容

真实案例：加拿大航空聊天机器人事件

2024 年，加拿大航空输掉了一场官司，因为他们的聊天机器人向客户提供了关于丧亲票价 ( Bereavement Fare ) 的错误信息——法院认定加拿大航空要对聊天机器人说的话负责。这就是一个价值 65 万美元的输出护栏问题。模型对政策产生了幻觉，而且没有任何验证层在信息到达客户之前拦截它。

下面教你如何构建护栏来防范这类问题：

Copyimport re
import anthropic
from dataclasses import dataclass
from enum import Enum

class GuardrailResult(Enum):
    PASS = "pass"
    FAIL = "fail"
    MODIFIED = "modified"

@dataclass
class GuardrailOutcome:
    status: GuardrailResult
    content: str
    reason: str = ""

class InputGuardrail:
    """Validates and sanitizes user input before it reaches the LLM."""
    def __init__(self, max_length: int = 2000, blocked_patterns: list[str] = None):
        self.max_length = max_length
        self.blocked_patterns = blocked_patterns or []
    def run(self, user_input: str) -> GuardrailOutcome:
        # Length check
        if len(user_input) > self.max_length:
            return GuardrailOutcome(
                status=GuardrailResult.MODIFIED,
                cnotallow=user_input[:self.max_length],
                reasnotallow=f"Input truncated from {len(user_input)} to {self.max_length} chars"
            )
        # Pattern blocking (e.g., prompt injection attempts)
        for pattern inself.blocked_patterns:
            if re.search(pattern, user_input, re.IGNORECASE):
                return GuardrailOutcome(
                    status=GuardrailResult.FAIL,
                    cnotallow="",
                    reasnotallow=f"Input blocked by pattern: {pattern}"
                )
        return GuardrailOutcome(status=GuardrailResult.PASS, cnotallow=user_input)

class OutputGuardrail:
    """
    Validates LLM output against business rules.
    Uses a secondary LLM call for semantic validation.
    """
    def __init__(self, client: anthropic.Anthropic, business_rules: list[str]):
        self.client = client
        self.business_rules = business_rules
    def run(self, llm_output: str) -> GuardrailOutcome:
        rules_text = "\n".join(f"- {rule}"for rule inself.business_rules)
        validation_prompt = f"""
        You are a strict policy compliance checker. Analyze the following response
        and check if it violates any business rules.
        Business Rules:
        {rules_text}
        Response to check:
        ---
        {llm_output}
        ---
        Respond ONLY with a JSON object:
        {{
            "compliant": true/false,
            "violations": ["list of violated rules if any"],
            "safe_to_show": true/false
        }}
        """
        result = self.client.messages.create(
            model="claude-opus-4-5",
            max_tokens=256,
            messages=[{"role": "user", "content": validation_prompt}]
        )
        import json
        try:
            verdict = json.loads(result.content[0].text)
        except json.JSONDecodeError:
            # If we can't parse the verdict, fail safe
            return GuardrailOutcome(
                status=GuardrailResult.FAIL,
                cnotallow=llm_output,
                reasnotallow="Could not parse compliance verdict - failing safe"
            )
        ifnot verdict.get("safe_to_show", True):
            return GuardrailOutcome(
                status=GuardrailResult.FAIL,
                cnotallow="I'm sorry, I can't provide that information. Please contact support.",
                reasnotallow=f"Violations: {verdict.get('violations', [])}"
            )
        return GuardrailOutcome(status=GuardrailResult.PASS, cnotallow=llm_output)

class GuardedLLMPipeline:
    """Wraps an LLM call with input and output guardrails."""
    def __init__(
        self,
        client: anthropic.Anthropic,
        system_prompt: str,
        input_guardrail: InputGuardrail,
        output_guardrail: OutputGuardrail
    ):
        self.client = client
        self.system_prompt = system_prompt
        self.input_guardrail = input_guardrail
        self.output_guardrail = output_guardrail
    def run(self, user_input: str) -> dict:
        # Step 1: Input guardrail
        input_check = self.input_guardrail.run(user_input)
        if input_check.status == GuardrailResult.FAIL:
            return {
                "response": "I can't process that request.",
                "blocked_at": "input",
                "reason": input_check.reason
            }
        safe_input = input_check.content
        # Step 2: LLM call
        message = self.client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            system=self.system_prompt,
            messages=[{"role": "user", "content": safe_input}]
        )
        raw_output = message.content[0].text
        # Step 3: Output guardrail
        output_check = self.output_guardrail.run(raw_output)
        return {
            "response": output_check.content,
            "passed_output_guardrail": output_check.status == GuardrailResult.PASS,
            "reason": output_check.reason
        }

# --- Example: Airline Customer Support ---
if __name__ == "__main__":
    client = anthropic.Anthropic()
    pipeline = GuardedLLMPipeline(
        client=client,
        system_prompt="""You are a customer support agent for FlyHigh Airlines.
        Be helpful, empathetic, and accurate. Only discuss policies you know for certain.""",
        input_guardrail=InputGuardrail(
            max_length=500,
            blocked_patterns=[r"ignore previous instructions", r"jailbreak", r"DAN"]
        ),
        output_guardrail=OutputGuardrail(
            client=client,
            business_rules=[
                "Never promise refunds without explicitly saying 'subject to our refund policy'",
                "Never state specific compensation amounts unless they are $0, $50, or $200",
                "Never make claims about competitor airlines",
                "Always recommend contacting support@flyhigh.com for complex issues"
            ]
        )
    )
    result = pipeline.run("Can I get a full refund if my flight is cancelled?")
    print(result["response"])

这里的模式——故障安全 ( Fail Safe )、语义验证、用第二个模型调用做裁判——正是玩具级大语言模型应用和生产级系统的分水岭。

支柱三：工具编排——教大语言模型行动，而不只是说话

过去一年大语言模型系统最强大的转变不是更好的提示词，而是**工具使用 ( Tool Use )**——让模型具备采取行动的能力，而不只是生成文本。而可靠地编排多个工具，本身就是一个真正的工程问题。

大多数人的做法是错的：给模型十几个工具然后指望它自己搞明白。真正的Harness工程意味着显式的编排逻辑——在可能的地方确定性地路由、重试、验证和排序工具调用，只在必要的地方才交给模型概率性决策。

一个真实的工具编排示例

让我们构建一个研究助手，能够搜索网络、读取 URL、综合研究结果——并且有恰当的错误处理和降级方案。

Copyimport anthropic
import json
import httpx
from dataclasses import dataclass
from typing import Any

@dataclass
class ToolResult:
    tool_name: str
    success: bool
    data: Any
    error: str = ""

class ToolRegistry:
    """Central registry for all tools available to the orchestrator."""
    def __init__(self):
        self._tools = {}
    def register(self, name: str, handler: callable, schema: dict):
        self._tools[name] = {"handler": handler, "schema": schema}
    def get_schemas(self) -> list[dict]:
        return [schema["schema"] for schema inself._tools.values()]
    def execute(self, name: str, inputs: dict) -> ToolResult:
        if name notinself._tools:
            return ToolResult(tool_name=name, success=False, data=None,
                              error=f"Tool '{name}' not found in registry")
        try:
            result = self._tools[name]["handler"](**inputs)
            return ToolResult(tool_name=name, success=True, data=result)
        except Exception as e:
            return ToolResult(tool_name=name, success=False, data=None, error=str(e))

def mock_web_search(query: str, max_results: int = 3) -> list[dict]:
    """
    In production, this would call a real search API (Brave, Serper, etc.)
    We mock it here to keep the example self-contained.
    """
    return [
        {"title": f"Result 1 for '{query}'", "url": "https://example.com/1",
         "snippet": f"Detailed info about {query} from source 1..."},
        {"title": f"Result 2 for '{query}'", "url": "https://example.com/2",
         "snippet": f"Another perspective on {query} from source 2..."},
    ]

def fetch_url_content(url: str) -> str:
    """Fetch and return text content from a URL."""
    try:
        # In production, add proper headers, timeouts, and HTML parsing
        response = httpx.get(url, timeout=10.0, follow_redirects=True)
        response.raise_for_status()
        return response.text[:3000]  # Limit content length
    except httpx.HTTPError as e:
        raise ValueError(f"Failed to fetch {url}: {e}")

def calculate(expression: str) -> float:
    """Safely evaluate a mathematical expression."""
    # In production, use a proper math parser - never raw eval()
    allowed_chars = set("0123456789+-*/()., ")
    ifnot all(c in allowed_chars for c in expression):
        raise ValueError(f"Unsafe expression: {expression}")
    return eval(expression)  # noqa: S307 - safe after allowlist check

class AgenticOrchestrator:
    """
    Runs an agentic loop: LLM decides what tools to call,
    orchestrator executes them, feeds results back, repeats.
    """
    def __init__(self, client: anthropic.Anthropic, registry: ToolRegistry,
                 max_iterations: int = 5):
        self.client = client
        self.registry = registry
        self.max_iterations = max_iterations
    def run(self, task: str, system_prompt: str = "") -> str:
        messages = [{"role": "user", "content": task}]
        iteration = 0
        print(f"\n🔧 Starting agentic loop for: {task[:80]}...")
        while iteration < self.max_iterations:
            iteration += 1
            print(f"\n  Iteration {iteration}/{self.max_iterations}")
            response = self.client.messages.create(
                model="claude-opus-4-5",
                max_tokens=2048,
                system=system_prompt,
                tools=self.registry.get_schemas(),
                messages=messages
            )
            # Check if model is done
            if response.stop_reason == "end_turn":
                final_text = next(
                    (block.text for block in response.content if hasattr(block, "text")),
                    "Task complete."
                )
                print(f"  ✅ Agent finished after {iteration} iterations")
                return final_text
            # Process tool calls
            if response.stop_reason == "tool_use":
                # Add assistant's response to history
                messages.append({"role": "assistant", "content": response.content})
                tool_results = []
                for block in response.content:
                    if block.type != "tool_use":
                        continue
                    print(f"  🔨 Tool call: {block.name}({json.dumps(block.input)[:60]}...)")
                    result = self.registry.execute(block.name, block.input)
                    ifnot result.success:
                        print(f"  ❌ Tool failed: {result.error}")
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result.data) if result.success else f"ERROR: {result.error}",
                        "is_error": not result.success
                    })
                messages.append({"role": "user", "content": tool_results})
        return"Max iterations reached. Partial results may be incomplete."

# --- Wire it all together ---
if __name__ == "__main__":
    client = anthropic.Anthropic()
    registry = ToolRegistry()
    registry.register(
        name="web_search",
        handler=mock_web_search,
        schema={
            "name": "web_search",
            "description": "Search the web for information on a topic",
            "input_schema": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "max_results": {"type": "integer", "description": "Max results", "default": 3}
                },
                "required": ["query"]
            }
        }
    )
    registry.register(
        name="calculate",
        handler=calculate,
        schema={
            "name": "calculate",
            "description": "Evaluate a mathematical expression",
            "input_schema": {
                "type": "object",
                "properties": {
                    "expression": {"type": "string", "description": "Math expression to evaluate"}
                },
                "required": ["expression"]
            }
        }
    )
    orchestrator = AgenticOrchestrator(
        client=client,
        registry=registry,
        max_iteratinotallow=6
    )
    result = orchestrator.run(
        task="Research the current state of quantum computing and estimate how many years until RSA-2048 encryption might be at risk. Show your calculation.",
        system_prompt="You are a thorough research assistant. Always verify claims with searches and show your reasoning."
    )
    print("\n=== FINAL ANSWER ===")
    print(result)

注意编排器里发生了什么：我们不只是在给模型发提示词。我们在构建一个反馈循环——模型调用工具 → 工具结果返回 → 模型决定下一步做什么。这就是 Claude 的计算机使用、ChatGPT 的代码解释器，以及当今所有严肃的生产级大语言模型智能体背后的基本模式。

3、把所有东西组装起来——Harness技术栈

现实世界中一个完整的Harness长这样。把它想象成层叠结构：

Copy┌─────────────────────────────────────────┐
│           User / External System        │
├─────────────────────────────────────────┤
│         Input Guardrails Layer          │  ← Validate, sanitize, route
├─────────────────────────────────────────┤
│         Context Management Layer        │  ← RAG, memory, history trimming
├─────────────────────────────────────────┤
│              LLM Call(s)                │  ← The actual model
├─────────────────────────────────────────┤
│         Output Guardrails Layer         │  ← Validate, filter, format
├─────────────────────────────────────────┤
│         Observability Layer             │  ← Log, trace, alert
├─────────────────────────────────────────┤
│           Eval Loop (Async)             │  ← Score, compare, regress
└─────────────────────────────────────────┘

大规模构建这套技术栈的真实团队包括：

Notion AI — 使用评估管道持续测试他们的写作辅助功能，覆盖几十种文档类型
Klarna — 在客服大语言模型周围构建了护栏，据报道处理了数百万次对话，每条回复都做了政策合规检查
GitHub Copilot — 整个系统就是一个Harness：代码上下文检索、意图分类、安全漏洞响应过滤、延迟优化——提示词几乎是其中最没意思的部分

学习资源推荐

如果你想更深入地学习大模型，以下是一些非常有价值的学习资源，这些资源将帮助你从不同角度学习大模型，提升你的实践能力。

一、全套AGI大模型学习路线

AI大模型时代的学习之旅：从基础到前沿，掌握人工智能的核心技能！

因篇幅有限，仅展示部分资料，需要点击文章最下方名片即可前往获取

二、640套AI大模型报告合集

这套包含640份报告的合集，涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师，还是对AI大模型感兴趣的爱好者，这套报告合集都将为您提供宝贵的信息和启示

因篇幅有限，仅展示部分资料，需要点击文章最下方名片即可前往获取

三、AI大模型经典PDF籍

随着人工智能技术的飞速发展，AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型，如GPT-3、BERT、XLNet等，以其强大的语言理解和生成能力，正在改变我们对人工智能的认识。那以下这些PDF籍就是非常不错的学习资源。

因篇幅有限，仅展示部分资料，需要点击文章最下方名片即可前往获取

四、AI大模型商业化落地方案

作为普通人，入局大模型时代需要持续学习和实践，不断提高自己的技能和认知水平，同时也需要有责任感和伦理意识，为人工智能的健康发展贡献力量。

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

大模型推理的 Decode 阶段到底是读权重慢，还是读 KV Cache 慢？用两个小实验快速定位瓶颈（GPT-5.4-high 生成）

大模型推理进入 decode 阶段后，很多人会发现 GPU 利用率不高、显存占用很高、tokens/s 上不去。但问题到底出在读模型权重，还是读 KV cache？这篇文章给出一个非常实用的判断方法：固定模型和输出长度，只做两组小实验，分别拉长上下文和拉高并发，就能快速判断瓶颈更偏权重读取、KV cache，还是两者混合。