大模型无法稳定输出 JSON？几个小技巧解决！

json

适用于现代 C++ 的 JSON。

项目地址：https://gitcode.com/gh_mirrors/js/json

免费下载资源

2401_85963303

1707人浏览 · 2024-07-22 14:36:35

2401_85963303 · 2024-07-22 14:36:35 发布

引子

在几乎除了聊天以外所有的程序调用场景中，我们都希望 LLM 通过某种结构化的方式来输出，便于后续程序处理。在本文中，我们采用一个推书的例子，通过几种方式由简单到复杂地让 LLM 结构化地输出结果。我采用的方法尽量不依赖某个平台或模型的特有功能，而是一些通用的方式来实现。

这个例子很简单，向 LLM 提供一个主题，然后让它推荐几本相关的书，列出其名称、作者、推荐原因以及发表年份。对应的 prompt 可以这么写：

I want you to recommend some books about {topic}.

一般来说，LLM 会给出一大段话，然后用子弹列表的形式列举（当然这个 prompt 太过简单，不一定我想要的四个字段都有）。

对于「纯文本」，程序显然是无法「稳定」解析的。我们需要让它以某种结构化的方式进行输出，例如 JSON 或者 XML。本文中，我们选择 JSON 作为「结构化」输出，我们希望 LLM 输出以下格式的内容。

{
  "items": [
    {
      "name": "1984",
      "author": "George Orwell",
      "reason": "Another classic dystopian novel that explores themes of surveillance, totalitarianism, and individuality in a future society.",
      "year_of_publish": 1949
    },
    {
      "name": "Dune",
      "author": "Frank Herbert",
      "reason": "A sprawling epic set on the desert planet of Arrakis, dealing with politics, religion, and the struggle for control of the planet's valuable spice.",
      "year_of_publish": 1965
    }
  ]
}

注：尽管列表也是一个标准的 JSON，但 OpenAI 的 JSON mode 只支持 JSON object，因此套多一层 items。

起手式：输出示例

先说最简单、通常有效的方式：在 system 中以示例的方式要求 LLM 输出对应格式。

I want you to recommend some books about {topic}.
Do NOT include anything other than a json object in your output.

Your output should look like this:
{
  "items": [
    {
      "name": "1984",
      "author": "George Orwell",
      "reason": "Another classic dystopian novel that explores themes of surveillance, totalitarianism, and individuality in a future society.",
      "year_of_publish": 1949
    },
    {
      "name": "Dune",
      "author": "Frank Herbert",
      "reason": "A sprawling epic set on the desert planet of Arrakis, dealing with politics, religion, and the struggle for control of the planet's valuable spice.",
      "year_of_publish": 1965
    }
  ]
}

划重点，Your output should look like this: 让模型以指定的格式输出。

这种方式的好处是非常通用，对任意模型都可以用，而且消耗的 token 数相对比较少（你甚至可以把长文本直接替换成 xxx）。坏处是，当结构比较复杂（例如同时存在多种类型）或者逻辑比较复杂时，或者模型抽风，就容易生成出多余的东西，无法解析到有效的 JSON。

进阶：JSON mode

针对上面模型抽风输出了无效 JSON 的场景，OpenAI 和 Claude都有 JSON mode，其中 Claude 还支持 XML。在指定输出格式后，模型会「尽力保证」输出合法的 JSON object（是的，还是有可能抽风）。

需要注意，OpenAI 的模型需要在 prompt 中包含「JSON」字样才能启用 JSON mode，否则会生成失败。我们只需稍作修改：

I want you to recommend some books about {topic}.
Do NOT include anything other than a json object in your output.

Your output should be in JSON format. For example:
(...省略示例...)

使用 JSON mode 之后，稳定性会有所提升。

组合拳： few-shot

few-shot（又称少样本提示）是指给模型提供一点示例，从而引导模型实现更好的性能。其实我们的起手式就算是一种 few-shot，但是仅使用了 system 消息。通过增加 user 和 assistant 消息，可能会让效果更好。

--- system ---
I want you to recommend some books about the given topic.
Do NOT include anything other than a json object in your output.

--- user ---
{topic}

--- assistant ---
{
  "items": [
    {
      "name": "1984",
      "author": "George Orwell",
      "reason": "Another classic dystopian novel that explores themes of surveillance, totalitarianism, and individuality in a future society.",
      "year_of_publish": 1949
    },
    {
      "name": "Dune",
      "author": "Frank Herbert",
      "reason": "A sprawling epic set on the desert planet of Arrakis, dealing with politics, religion, and the struggle for control of the planet's valuable spice.",
      "year_of_publish": 1965
    }
  ]
}

这种方式一般会比起手式更稳定，但是也可能会消耗更多的 token。

终结技： JSON schema

如果我们需要给 JSON 引入更加复杂的结构，或者要使用枚举等等，用之前的方式不一定能获得稳定的结构化输出。而 JSON 是有 schema 的，通过指定 JSON schema，我们可以实现更加复杂的结构以及使用枚举等功能。

这里我们增加一个 genre 的枚举字段用来演示。

I want you to recommend some books about {topic}.

Your output should follow the JSON schema below:
{
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "author": {
            "type": "string"
          },
          "reason": {
            "type": "string"
          },
          "year_of_publish": {
            "type": "number"
          },
          "genre": {
            "type": "string"
            "enum": ["SCI-FI", "NON-SCI-FI"]
          }
        },
        "required": [
          "name",
          "author",
          "reason",
          "year_of_publish",
          "genre"
        ]
      }
    }
  },
  "required": [
    "items"
  ]
}

注意，使用 JSON schema 最好同时打开 JSON mode。通过这种方式，我们不需要给出例子（如果例子不恰当，可能会带偏 LLM，出现抽风），也不需要在 prompt 中再指定某个字段的取值，另外也很方便强类型语言进行后续处理。这种方式消耗的 token 数会更多，但是稳定性更佳。

在实践中，也有人使用 TypeScript 的结构体等方式来实现类似的效果，大体的思路是一样的。

后手：修复 JSON

当生成的 JSON 真的不合法时，可以通过一些方式尝试恢复成合法的 JSON。目前有一些现成的工具，例如以下几个。基本的原理是通过BNF来解析 JSON，通过给数组或对象添加未闭合的括号、给字符串添加引号、调整空白或换行等启发式规则，尝试修复 JSON。

实战经验

可以先从最简单的方式入手，如果发现输出不稳定，再辅以其他手段
适当降低 Temperature 也有助于生成稳定的结构化输出
代码层面需要做好兼容，解析失败时可以采取重试等方法

GitHub 加速计划 / js / json

41.72 K

6.61 K

下载

适用于现代 C++ 的 JSON。

最近提交(Master分支：1 个月前 )

960b763e 4 个月前

8c391e04 7 个月前

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m