昇腾CANN cann-recipes-harmony-infer：鸿蒙端侧推理部署的完整指南

雨季666

39人浏览 · 2026-05-22 11:54:19

雨季666 · 2026-05-22 11:54:19 发布

手机、平板、手表——这些鸿蒙设备上跑 AI 模型，和数据中心的服务器是两个世界。cann-recipes-harmony-infer 是 CANN 社区针对鸿蒙（HarmonyOS）端侧推理的菜谱仓库：把大模型压缩到手机能跑的大小，在有限的 NPU 算力和内存下保持可用精度。

端侧推理和云端推理的本质区别

维度	云端 NPU（Atlas 900）	端侧 NPU（手机芯片）
算力	256+ TFLOPS (FP16)	4-8 TFLOPS (FP16)
显存	64-128 GB HBM	4-8 GB LPDDR
功耗	300W+	<5W
batch	8-64	1（实时性要求）
模型大小	不限（多卡拆分）	<500MB（安装包限制）
延迟	不敏感	<100ms（用户体验）

端侧推理的核心矛盾：模型越大越好（精度高）vs 端侧资源越小越少（跑不动）。cann-recipes-harmony-infer 解决的就是这个矛盾。

鸿蒙端侧推理 Pipeline

┌─────────────────────────────────────────────┐
│ 云端训练（PyTorch / MindSpore）               │
│ 大模型 → 高精度权重                            │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│ 模型压缩（amct + CANN 工具链）                │
│ 量化(INT4/INT8) + 剪枝 + 蒸馏               │
│ 7B 模型 → 500MB                              │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│ 离线编译（ATC → .ms 模型格式）                │
│ .onnx / .mindir → .ms（鸿蒙端侧格式）         │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│ 鸿蒙设备推理（HiAI Engine）                   │
│ 模型加载 → 输入预处理 → NPU 推理 → 输出后处理  │
└─────────────────────────────────────────────┘

鸿蒙端侧的模型格式是 .ms（MindSpore Lite），不是云端的 .om（Offline Model）。两者都由 CANN 编译器生成，但 .ms 针对端侧做了额外优化：图融合更激进、算子实现更精简、内存布局更紧凑。

INT4 量化：7B 模型塞进手机

云端推理用 INT8 量化已经足够。端侧推理要用 INT4——因为模型大小是硬约束。

// cann-recipes-harmony-infer/quantization/int4_quant.cpp

// INT4 量化：每个权重用 4 bit 表示（16 个等级）
// 相比 FP32（32 bit）：压缩比 8×
// 相比 INT8（8 bit）：压缩比 2×

void QuantizeToInt4(
    const float* weights,      // FP32 权重
    int8_t* quant_weights,     // INT4 权重（两个 INT4 打包成一个 INT8）
    float* scales,             // 每个组的 scale
    int8_t* zero_points,       // 每个组的 zero_point
    int rows, int cols,
    int group_size             // 分组大小（如 32）
) {
    int num_groups = cols / group_size;

    for (int r = 0; r < rows; r++) {
        for (int g = 0; g < num_groups; g++) {
            // 找当前组的 min/max
            float w_min = FLT_MAX, w_max = -FLT_MAX;
            for (int c = 0; c < group_size; c++) {
                float w = weights[r * cols + g * group_size + c];
                w_min = min(w_min, w);
                w_max = max(w_max, w);
            }

            // INT4 的范围：[-8, 7]（有符号）
            float scale = (w_max - w_min) / 15.0f;
            float zp = round(-w_min / scale) - 8.0f;
            scales[r * num_groups + g] = scale;
            zero_points[r * num_groups + g] = (int8_t)clamp(zp, -8, 7);

            // 量化
            for (int c = 0; c < group_size; c++) {
                float w = weights[r * cols + g * group_size + c];
                int8_t q = (int8_t)clamp(round(w / scale + zp), -8, 7);

                // 两个 INT4 打包成一个 INT8
                int idx = r * cols + g * group_size + c;
                if (c % 2 == 0) {
                    // 高 4 位
                    quant_weights[idx / 2] = (q & 0x0F) << 4;
                } else {
                    // 低 4 位
                    quant_weights[idx / 2] |= (q & 0x0F);
                }
            }
        }
    }
}

INT4 量化的关键参数是 group_size——越小精度越高（更细粒度的 scale），但 scale 数组也越大（额外存储开销）。group_size=32 是经验最优：精度损失 < 1%，额外开销仅 4%。

端侧 NPU 的矩阵乘：INT4 特殊加速

端侧 NPU 的 Cube 单元支持 INT4 矩阵乘的硬件加速——两个 INT4 权重在一次乘加操作里完成解包和计算：

// cann-recipes-harmony-infer/kernels/int4_matmul.cpp

__aicore__ void Int4MatMul(
    LocalTensor<float>& output,       // [M, N]
    LocalTensor<int8_t>& weight_int4,  // [K, N/2]（INT4 打包）
    LocalTensor<float>& input,         // [M, K]
    LocalTensor<float>& scales,        // [K/groups, N]（分组 scale）
    LocalTensor<int8_t>& zps,         // [K/groups, N]（分组 zero_point）
    int M, int K, int N, int groups
) {
    // INT4 矩阵乘流程：
    // 1. 解包 INT4 → INT8（硬件自动完成）
    // 2. 反量化 INT8 → INT16（乘 scale + 加 zero_point）
    // 3. INT16 矩阵乘（Cube 单元）
    // 4. 累加到 INT32
    // 5. 转回 FP32 输出

    for (int m = 0; m < M; m++) {
        for (int n = 0; n < N; n++) {
            int32_t acc = 0;

            for (int k = 0; k < K; k++) {
                // 解包：从 INT8 中提取两个 INT4
                int8_t packed = weight_int4[k * (N/2) + n/2];
                int8_t q;
                if (n % 2 == 0) {
                    q = (packed >> 4) & 0x0F;     // 高 4 位
                    if (q > 7) q -= 16;            // 有符号扩展
                } else {
                    q = packed & 0x0F;             // 低 4 位
                    if (q > 7) q -= 16;
                }

                // 反量化
                int g = k / groups;
                float scale_val = scales[g * N + n];
                int8_t zp_val = zps[g * N + n];
                float dequant = (float(q) - float(zp_val)) * scale_val;

                // 累加
                acc += int32_t(dequant * input[m * K + k]);
            }

            output[m * N + n] = float(acc);
        }
    }
}

端侧 Cube 单元的 INT4 加速：同一个时钟周期内可以处理两倍 INT8 的元素数量（4 bit vs 8 bit）。理论吞吐量翻倍——前提是算子实现正确解包了 INT4 的位排列。

端侧特有的优化：图融合更激进

鸿蒙端侧推理的 kernel launch 开销比云端更大——端侧 NPU 的主频低，每次 kernel launch 要经过操作系统调度。所以端侧推理的图融合策略比云端更激进：能融的都融。

// 云端：LayerNorm 单独一个 kernel，BiasAdd + GELU 融合
// 端侧：LayerNorm + BiasAdd + GELU 融合成一个 kernel

// 云端图：
// input → LayerNorm → Add(Bias) → GELU → output
// （3 次 kernel launch）

// 端侧图：
// input → FusedNormBiasGELU → output
// （1 次 kernel launch）

__aicore__ void FusedNormBiasGELU(
    LocalTensor<float>& output,
    LocalTensor<float>& input,
    LocalTensor<float>& bias,
    LocalTensor<float>& gamma,   // LN scale
    LocalTensor<float>& beta,    // LN shift
    int size
) {
    // LayerNorm 第一步：计算 mean
    float sum = 0.0f;
    for (int i = 0; i < size; i++) sum += input[i];
    float mean = sum / size;

    // LayerNorm 第二步：计算 variance
    float var_sum = 0.0f;
    for (int i = 0; i < size; i++) {
        float diff = input[i] - mean;
        var_sum += diff * diff;
    }
    float inv_std = 1.0f / sqrt(var_sum / size + 1e-5f);

    // LayerNorm + BiasAdd + GELU：一步完成
    for (int i = 0; i < size; i++) {
        float normed = (input[i] - mean) * inv_std;
        float scaled = normed * gamma[i] + beta[i];
        float biased = scaled + bias[i];
        // GELU(x) = 0.5 × x × (1 + tanh(0.797884 × (x + 0.044715 × x³)))
        float x3 = biased * biased * biased;
        output[i] = 0.5f * biased * (1.0f + tanhf(0.797884f * (biased + 0.044715f * x3)));
    }
}

一个 kernel 替代三个——在端侧这 3 次 kernel launch 的省去，可能比计算本身的优化还重要（launch 开销占端侧延迟的 30-40%）。

踩坑一：INT4 量化后的大模型推理精度崩塌

INT8 量化大模型的精度损失通常 < 1%。INT4 量化的精度损失可能达到 5-10%——某些层对量化特别敏感。

错误的量化策略：全模型统一 INT4 量化。

// 所有 linear 层统一量化到 INT4
// qkv_proj（注意力输入投影）：精度损失 0.3%（可以接受）
// o_proj（注意力输出投影）：精度损失 0.5%（可以接受）
// mlp.up_proj + down_proj：精度损失 1.2%（勉强接受）
// mlp.gate_proj（gate 机制）：精度损失 8.5%（不可接受！）
//
// gate_proj 的输出决定哪些 token 被 mask 掉
// INT4 的 15 个量化等级分辨不了 gate 概率的细微差异
// → 大量 token 被错误地 mask → 生成质量崩塌

正确策略：混合精度量化——敏感层保持 INT8 或 FP16。

QuantConfig config;
config.default_dtype = "int4";
config.keep_fp16_layers = ["gate_proj", "lm_head"];
config.keep_int8_layers = ["q_proj", "v_proj"];  // 注意力比 MLP 更敏感

// 模型大小对比：
// 全 INT8：    7B × 1 byte = 7 GB（塞不进手机）
// 全 INT4：    7B × 0.5 byte = 3.5 GB（能塞进，但精度差）
// 混合精度：   6B × 0.5 + 1B × 1 = 4 GB（能塞进，精度好）

踩坑二：.ms 模型的内存布局和 .om 不兼容

同一个 PyTorch 模型，分别编译成 .om（云端）和 .ms（端侧），权重的内存布局不同：

.om：权重按行主序（Row Major），对齐到 32 字节
.ms：权重按 NC/1HWC 或 NCHW 排列（取决于算子类型），对齐到 16 字节

错误：把云端的权重文件直接拷贝到端侧加载。

// 云端模型推理正常
// 端侧加载同一份权重 → 输出全是乱码
// 根因：权重数据没有重新排列
// .om 的 Linear 权重：[out_features, in_features]，行主序
// .ms 的 Linear 权重：[in_features, out_features]，转置了

正确：分别编译和部署。

# 云端：PyTorch → .onnx → .om
atc --model=model.onnx --output=model.om --framework=5

# 端侧：PyTorch → .onnx → .ms
mindspore_lite_converter --model_file=model.onnx \
    --output_file=model.ms \
    --format=ONNX \
    --optimize=ascend_oriented

踩坑三：端侧推理的首次延迟（cold start）

手机上第一次加载模型时，模型要从磁盘读到内存、解析图结构、初始化 NPU——冷启动延迟可能超过 3 秒。用户打开一个「AI 助手」APP，等 3 秒才有反应——体验很差。

优化方案：模型预加载 + 算子预热。

// 鸿蒙端侧的模型预加载 API
// 在 APP 启动时后台加载模型（不等用户点击推理按钮）

#include "hiai_ir_build.h"

// 阶段一：APP 启动时（后台线程）
void AppInit() {
    // 从磁盘读 .ms 模型到内存
    // NPU 初始化在后台完成
    hiai::ModelManager::PreloadModel("assistant_model.ms");
}

// 阶段二：用户点击「发送」时
void OnUserSend(const std::string& prompt) {
    // 模型已经加载好了（冷启动省掉）
    // 但第一次推理仍可能有延迟（NPU cache 未命中）
    auto* model = hiai::ModelManager::GetModel("assistant_model.ms");
    model->Infer(input_tensor, output_tensor);
}

// 阶段三：算子预热（可选）
// 在 APP 初始化时用 dummy 数据跑一次推理
void WarmupModel() {
    Tensor dummy = Tensor::Zeros({1, 512});
    hiai::ModelManager::GetModel("assistant_model.ms")->Infer(dummy, dummy);
    // 预热后，NPU 的 L1 cache 已经加载了权重
    // 后续真实推理的延迟稳定
}