用 OpenVINO™ 部署 MiniCPM5-1B：端侧 1B 推理模型与 Hybrid Reasoning 实战

英特尔开发人员专区

91人浏览 · 2026-05-26 17:55:20

英特尔开发人员专区 · 2026-05-26 17:55:20 发布

作者：杨亦诚

一、导语：把推理模型搬上 AI PC

OpenBMB 最新发布 MiniCPM5-1B——MiniCPM5 系列首发模型，dense 1B 主干，主打端侧部署、agent 工具调用、代码生成与困难推理。它最大的特色是 Hybrid Reasoning：同一份权重通过 chat template 的 enable_thinking 字段，即可在「快速回答」与「先想后答」两种模式之间切换，避免为不同场景维护多个 checkpoint。

本文给出一条参考部署路径：环境安装 → optimum-cli 一键导出 INT4 IR → openvino-genai LLMPipeline 推理 → Hybrid Reasoning 双模式实战。文中所有命令与代码已在 Intel 平台 Windows + Python 3.12 上跑通，附本机验证产物，完整示例可以参考https://github.com/openvinotoolkit/openvino_notebooks/pull/3468。

二、MiniCPM5-1B 模型亮点

2.1 架构规格：dense 1B / 24 层 / 131K 上下文

类型：Causal Language Model，标准 LlamaForCausalLM 结构（dense，非 MoE）；
参数：1,080,632,832（约 1B），non-embedding 参数 679M；
层数：24，注意力 GQA 16 Q heads / 2 KV heads；
上下文长度：131,072 tokens，训练精度 BF16；
License：Apache-2.0，可商用。

2.2 同尺寸开源 SOTA

模型卡披露 MiniCPM5-1B 在与 LFM2.5-1.2B-Thinking、Qwen3-0.6B/think、Qwen3.5-0.8B/think 同档对比中达到 1B 级开源 SOTA，最大领先项是 工具调用、代码生成、困难推理。

RL + OPD（On-Policy Distillation）后训练带来的效果：

数学 / 代码 / 指令遵循平均分 +16 分
命中 max-tokens 上限的「过长输出」比例 −29 个百分点

2.3 Hybrid Reasoning：同一权重，两种思考模式

MiniCPM5-1B 的 chat template 内置 <think> 块，通过传给 apply_chat_template 的 enable_thinking 控制：开启时模型先在 <think>...</think> 段中显式推理再给答案；关闭时则跳过推理直接回答。官方推荐采样参数：Think 模式 temperature=0.9, top_p=0.95；No-Think 模式 temperature=0.7, top_p=0.95。

三、为什么选 OpenVINO™ + Optimum Intel + GenAI

Optimum Intel：一行 optimum-cli export openvino --task text-generation-with-past 即可把 HF 模型转为 OpenVINO IR，并内置 NNCF INT4 / INT8 / FP16 权重压缩。
openvino-genai 中的 LLMPipeline 自带 KV-Cache 管理、流式 streamer、ChatHistory，与样例 chat_sample.py 同构，几行代码即可拿到生产级 chat 体验。
全后端：同一份 IR 模型可在 Intel CPU、酷睿 Ultra iGPU、Arc 独立显卡上无差别运行，仅修改 device 字符串。

四、端到端部署 MiniCPM5-1B

4.1 环境准备

建议 Python 3.10 及以上，先创建并激活独立 venv，再装本教程依赖：

python -m venv minicpm5-venv# Windows: minicpm5-venv\Scripts\activate# Linux / macOS: source minicpm5-venv/bin/activate
pip install -U \    "openvino-genai" \    "git+https://github.com/huggingface/optimum-intel.git" \    "nncf>=3.0" \    "transformers>=5.6" \    "torch" \    "accelerate" \    "huggingface_hub"

MiniCPM5 是新发布的模型，部分特性需要 optimum-intel 主线分支才能完整支持，因此先用 git 主干；待 PyPI 新版本发布后即可切回。

国内访问 HuggingFace 不畅时，可在导出命令前临时注入 hf-mirror：

# Linux / macOS / Git Bashexport HF_ENDPOINT=https://hf-mirror.com
# Windows PowerShell$env:HF_ENDPOINT = "https://hf-mirror.com"

4.2 一键导出 INT4 量化 IR

推荐使用 INT4 权重压缩——MiniCPM5-1B 量化后体积约 800 MB，1B dense 模型在 AI PC 上即可流畅交互：

optimum-cli export openvino \    --model openbmb/MiniCPM5-1B \    --task text-generation-with-past \    --weight-format int4 \    --group-size 128 \    --ratio 0.8 \    MiniCPM5-1B-ov/INT4

--task text-generation-with-past 告诉 optimum-intel 这是带 KV-Cache 的因果语言模型；--ratio 0.8 表示 80% 的权重走 INT4，剩余 20% 保留更高精度，是社区在 LLM 上验证较稳的默认折中；--group-size 128 是 NNCF 分组量化的组大小。

导出完成后目录结构：

MiniCPM5-1B-ov/INT4/├── chat_template.jinja├── config.json├── generation_config.json├── openvino_config.json├── openvino_detokenizer.bin / .xml├── openvino_model.bin (~822 MB) / .xml├── openvino_tokenizer.bin / .xml├── tokenizer.json└── tokenizer_config.json

想要更高质量或更小体积，分别有：

# INT8optimum-cli export openvino --model openbmb/MiniCPM5-1B \    --task text-generation-with-past --weight-format int8 \    MiniCPM5-1B-ov/INT8
# FP16optimum-cli export openvino --model openbmb/MiniCPM5-1B \    --task text-generation-with-past --weight-format fp16 \    MiniCPM5-1B-ov/FP16

4.3 复刻 chat_sample.py：最小可跑示例

下面这段代码是 OpenVINO GenAI 官方样例 chat_sample.py 的最小化版本，配合 HuggingFace 模型卡 transformers 章节里的同一句 prompt——Who are you? Please briefly introduce yourself.：

import openvino_genai
MODEL_DIR = "MiniCPM5-1B-ov/INT4"DEVICE = "CPU"  # 也可填 "GPU" / "AUTO"
pipe = openvino_genai.LLMPipeline(MODEL_DIR, DEVICE)
config = openvino_genai.GenerationConfig()config.max_new_tokens = 128
def streamer(subword: str) -> openvino_genai.StreamingStatus:    print(subword, end="", flush=True)    return openvino_genai.StreamingStatus.RUNNING
# 与 chat_sample.py 一致：用 ChatHistory 累积多轮上下文history = openvino_genai.ChatHistory()history.append({    "role": "user",    "content": "Who are you? Please briefly introduce yourself.",})
result = pipe.generate(history, config, streamer)print("\n----------")

运行后预期看到模型流式打印自我介绍，并以 result.texts[0] 拿到完整回答。 本机实测（INT4 + Intel CPU）输出节选：

<think>Hmm, the user is asking about my identity and wants a brief introduction....</think>
I'm a model from the MiniCPM series, developed by ModelBest Inc. and the Open...

默认 chat template 启用了 thinking，所以输出里看到了 <think>...</think> 块——下一节我们用 enable_thinking 显式控制这个行为。

4.4 多轮 chat 循环（与官方样例同构）

把 4.3 包装成一个交互循环就是 chat_sample.py 的标准形态：

import openvino_genai
pipe = openvino_genai.LLMPipeline("MiniCPM5-1B-ov/INT4", "CPU")config = openvino_genai.GenerationConfig()config.max_new_tokens = 256
def streamer(subword):    print(subword, end="", flush=True)    return openvino_genai.StreamingStatus.RUNNING
history = openvino_genai.ChatHistory()while True:    try:        prompt = input("\nquestion:\n")    except EOFError:        break    history.append({"role": "user", "content": prompt})    result = pipe.generate(history, config, streamer)    history.append({"role": "assistant", "content": result.texts[0]})    print("\n----------")

五、Hybrid Reasoning：think vs no-think 实战

openvino-genai 提供了 Tokenizer.apply_chat_template 的 extra_context 参数，正好用来把 enable_thinking 透传给模型自带的 chat template，从而精确切换两种模式：

import openvino_genai
pipe = openvino_genai.LLMPipeline("MiniCPM5-1B-ov/INT4", "CPU")tokenizer = pipe.get_tokenizer()
messages = [    {"role": "user",     "content": "Who are you? Please briefly introduce yourself."},]
def streamer(subword):    print(subword, end="", flush=True)    return openvino_genai.StreamingStatus.RUNNING
for mode, enable_thinking, T in [("NO-THINK", False, 0.7),                                 ("THINK",    True,  0.9)]:    print(f"=== {mode} ===")    prompt = tokenizer.apply_chat_template(        messages,        add_generation_prompt=True,        extra_context={"enable_thinking": enable_thinking},    )
    config = openvino_genai.GenerationConfig()    config.max_new_tokens = 256    config.do_sample = True    config.temperature = T    config.top_p = 0.95
    pipe.generate(prompt, config, streamer)    print()

本机实测两种模式下，apply_chat_template 生成的 prompt 尾部明显不同，决定了模型行为：

# enable_thinking=False -> assistant 段落里被预填了空 think'... <|im_start|>assistant\n<think>\n\n</think>\n\n'# enable_thinking=True  -> 留出空 think 让模型自由展开'... <|im_start|>assistant\n<think>\n'

实测输出节选（INT4 + CPU + 上述采样参数）：

=== NO-THINK ===<think>
</think>
I am a MiniCPM series model, developed by ModelBest and the OpenBMB community.For more information about the project, visit https://github.com/OpenBMB/.
=== THINK ===<think>...</think>
I am a MiniCPM series model, developed by ModelBest and the OpenBMBopen-source community. ...

什么时候用 No-Think： 闲聊问答、客服 FAQ、纯执行任务——延迟优先，输出更短。
什么时候用 Think： 数学、代码、复杂规划——质量优先，可把 max_new_tokens 提到 1024 以上，并打开流式输出避免等待感。

六、性能与精度建议

INT4 体积约为 FP16 的 1/4；MiniCPM5-1B INT4 权重 ~820 MB， CPU 即可流畅推理；
若 Think 模式在某些 prompt 下推理过短或为空，可适当提高 temperature、加长 max_new_tokens；对量化敏感的任务可降级到 INT8 / FP16 复测。
对长上下文应用，记得在 GenerationConfig 之外，通过 GenAI 的 scheduler_config 控制 KV-Cache 显存占用。

七、总结与资源链接

MiniCPM5-1B 把「端侧 1B + Hybrid Reasoning」做到了开源 SOTA；OpenVINO + Optimum Intel + GenAI 这一套组合，让你只需一条 optimum-cli + 几十行 Python，就能在 Intel CPU / iGPU / 独显上跑出生产级别的多轮 chat，并随手切换两种思考模式。

MiniCPM5-1B 模型卡：
https://huggingface.co/openbmb/MiniCPM5-1B
OpenVINO notebook PR: https://github.com/openvinotoolkit/openvino_notebooks/pull/3468
OpenBMB/MiniCPM 仓库：https://github.com/OpenBMB/MiniCPM
openvino.genai chat_sample.py：https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/text_generation/chat_sample.py
Optimum Intel：https://github.com/huggingface/optimum-intel
OpenVINO：https://github.com/openvinotoolkit/openvino
OpenVINO GenAI：https://github.com/openvinotoolkit/openvino.genai
NNCF：https://github.com/openvinotoolkit/nncf

本文流程已在 Windows + Python 3.12 + Intel CPU 上完整验证：openvino-genai 2026.1、optimum-intel 主线分支、nncf 3.1、transformers 5.9；导出 INT4 IR 体积 ~822 MB，单图 prompt + ChatHistory 推理与 Hybrid Reasoning 双模式均成功跑通。

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

2024科技趋势：AI领跑，云边协同

2024年技术发展将围绕人工智能、云计算、边缘计算等核心领域展开，重点关注技术融合与实际应用场景的落地。技术发展将更强调“价值闭环”，即从技术创新到商业变现的路径缩短，同时跨领域协作（如AI+生物科技）可能催生突破性应用。

AtomGit开源社区

2024技术趋势：AI领跑，开发者必看22

强调技术快速迭代中持续学习的重要性，鼓励读者关注实践与理论结合。（注：可根据实际需求调整子标题深度或增删模块，如加入“行业案例”或“争议性技术讨论”等部分。

AtomGit开源社区

使用Koopman理论识别机器人动力学的非线性系统（Matlab代码实现）

实际中的大多数系统均为非线性系统，而Koopman算子可以描述非线性系统的可观测状态量在高维空间中的线性演化过程，可以将非线性问题转化为线性问题，对于非线性系统的研究有较大的价值。利用Koopman算子理论，可以仅依靠实验数据或系统仿真数据建立非线性系统的线性模型，基于该模型可实现对非线性系统的分析、预测和控制[6]。为了识别杜宾汽车模型的非线性动力学，我们使用Koopman算子理论首先从系统的仿