11｜结构化输出：为什么 JSON 能让系统更稳定

qqxhb

932人浏览 · 2026-03-18 09:59:49

qqxhb · 2026-03-18 09:59:49 发布

本篇目标：让 AI 从“陪聊”进化为“API 接口”。学会如何让它吐出机器能读懂的 JSON 数据，而不是一堆废话。

一、自然语言 vs 结构化数据

自然语言（Natural Language）：

用户：帮我提取一下张三的信息。
AI：好的！张三今年 25 岁，是个程序员，住在北京。
问题：你的代码怎么解析这句话？用正则表达式？太脆弱了。如果 AI 下次换个说法（“这人叫张三…”），你的正则就挂了。
结构化数据（Structured Data）：

用户：帮我提取张三的信息，格式为 JSON。
AI：
```
{
  "name": "张三",
  "age": 25,
  "job": "programmer",
  "city": "Beijing"
}
```
优势：你的代码可以直接用 json.loads() 读取，稳定可靠，甚至可以直接存进数据库。

结论：如果你想把 AI 集成到你的程序里（做自动化），必须强制它输出结构化数据。

二、如何让 AI 稳定输出 JSON？

很多人试过让 AI 输出 JSON，但经常翻车：

它会在 JSON 外面加废话（“好的，这是你要的 JSON…”）。
它会用 Markdown 包裹（` ```json … ````），导致解析失败。
它偶尔会漏掉括号，或者字段名写错。

1. 基础版：System Prompt 约束

在 System Prompt 里明确规定：

System: You are a data extraction assistant. You MUST return ONLY valid JSON. Do not include any explanation or markdown formatting.

2. 进阶版：提供 Schema（模具）

你不能只说“给我 JSON”，你得给它一个模具（Schema）。
这就好比你想做月饼，不能只给面粉，得给个模具，压出来的才一样。

Prompt 示例：

请从下面的文本中提取用户信息。
输出必须符合以下 JSON Schema：
{
  "name": "string (姓名)",
  "age": "integer (年龄)",
  "skills": ["string (技能列表)"],
  "is_employed": "boolean (是否在职)"
}
文本：...

3. 终极版：使用 Pydantic（Python 神器）

如果你写 Python，Pydantic 是配合 AI 的绝配。它能把 Python 类自动变成 Schema 给 AI 看。

from pydantic import BaseModel
from typing import List

class UserInfo(BaseModel):
    name: str
    age: int
    skills: List[str]

# 把 UserInfo.model_json_schema() 发给 AI
# AI 就会乖乖按这个格式填空

三、实战：做一个“简历解析器”

假设你有一堆 PDF 简历（转成了文本），你想把它们存进 Excel。

❌ 失败的尝试

Prompt: “帮我看看这个简历，把名字、电话和毕业学校找出来。”
AI: “这份简历的主人叫李四，电话是 139xxxx，毕业于清华大学。”
（你需要人工复制粘贴到 Excel，累死。）

✅ 成功的尝试（JSON Mode）

Prompt:

# Role
You are a resume parser.

# Task
Extract structured data from the resume text below.

# Output Format
Return a JSON object with the following keys:
- `candidate_name`: string
- `phone_number`: string (format: 11 digits)
- `university`: string (highest degree)
- `years_of_experience`: integer

# Constraint
- If a field is missing, use null.
- Do NOT output anything other than the JSON.

Result:

{
  "candidate_name": "李四",
  "phone_number": "13900000000",
  "university": "清华大学",
  "years_of_experience": 3
}

你的 Python 脚本拿到这个 JSON，一行代码就能把它追加到 Excel 里。这就叫自动化。

四、常见坑点与修复

即便有了 JSON，AI 有时还是会犯错。

1. 尾部逗号错误

AI 经常在 JSON 列表的最后一项加逗号（在标准 JSON 里是非法的）。

对策：使用支持 loose parsing 的库（如 Python 的 json5），或者在 Prompt 里强调“Strict JSON syntax”。

2. Markdown 包裹

AI 喜欢输出：
在 Markdown 中，反引号用 ` 表示

```json
{ ... }
`` `

导致 json.loads() 报错。

对策：写个小函数清洗一下：

def clean_json_string(s):
    return s.replace("```json", "").replace("```", "").strip()

3. 字段类型错误

你要的是 age: 25（数字），它给你 age: "25"（字符串）。

对策：Schema 里必须标明类型（Integer vs String）。Pydantic 会自动帮你校验并转换。

本篇产出：通用 JSON 输出规范模板

把这个加到你的 Prompt 库里，专门对付需要数据提取的任务。

# Output Specification
1.  **Format**: The output MUST be a valid JSON object.
2.  **No Chatter**: Do not include any introductory text (e.g., "Here is the JSON") or concluding remarks.
3.  **Schema**:
    {
      "key1": "type (description)",
      "key2": ["type (description)"]
    }
4.  **Handling Missing Data**: If a field cannot be found, use `null` (do not make up data).
5.  **Escape Characters**: Ensure all strings are properly escaped (e.g., quotes inside strings).

下一步：我们学会了如何让 AI 输出结构化数据，这为通过代码调用它打下了基础。
但在 AI 眼里，文本到底是什么？为什么它能算出“猫”和“狗”很像，但和“桌子”不像？
下一章我们将深入Embedding（向量）——这是 AI 理解世界、做知识库检索（RAG）的核心基石。