CANN 精度调优：INT8 量化误差分析与混合精度策略实战

晚霞的不甘

315人浏览 · 2026-05-23 19:52:03

晚霞的不甘 · 2026-05-23 19:52:03 发布

一、量化误差从哪来

1.1 量化的基本过程

把 FP32 权重映射到 INT8 的过程：

原始值（FP32） → 缩放 → 取整 → 量化值（INT8）

核心公式：

scale = (max - max) / 255
zero_point = round(-min / scale)
quantized = clamp(round(x / scale) + zero_point, 0, 255)
dequantized = (quantized - zero_point) * scale

误差来自两个地方：

取整误差——round 操作把浮点数强制变成整数。比如 3.7 变成 4，0.3 变成 0，每次都有一点偏差。

范围截断——如果某个值超出了 INT8 的表示范围（-128 到 127），会被 clamp 强制截断。截断后的值和原始值差距很大。

1.2 误差的累积效应

单个值的量化误差很小（通常 < 0.1%），但深度网络有上百万个参数，误差会逐层累积：

输入误差 → 第1层放大 → 第2层再放大 → ... → 输出偏差

以 ResNet-50 为例：

层	参数量	平均量化误差	对输出的影响
conv1	9.4K	0.02%	可忽略
layer1	215K	0.05%	可忽略
layer2	1.2M	0.08%	轻微
layer3	5.0M	0.12%	明显
layer4	2.4M	0.15%	显著

最后一层的误差对输出影响最大，因为它离输出最近，没有后续层来"稀释"误差。

二、逐层敏感度分析

2.1 为什么需要敏感度分析

不是所有层对量化的敏感度相同。有些层量化后精度几乎不变，有些层量化后精度暴跌。找出敏感层，对它们保留高精度，对非敏感层用 INT8，就是混合精度量化的核心思路。

2.2 敏感度评估方法

import torch
import torch.nn as nn
import numpy as np


class LayerSensitivityAnalyzer:
    """逐层量化敏感度分析器

    原理:
    1. 逐层量化：每次只量化一层，其余保持 FP32
    2. 测量精度变化：量化某层后精度下降越多，说明该层越敏感
    3. 排序：按敏感度排序，确定哪些层需要保留 FP16

    为什么用"逐层"而不是"全部一起"?
    全部一起量化时，层之间的误差会互相影响，无法区分单层的贡献。
    逐层量化能精确测量每层的独立影响。

    评估指标:
    - 精度下降: 量化前后的 Top-1 精度差
    - 输出距离: 量化前后输出的 cosine similarity
    - 梯度敏感度: 损失函数对量化噪声的梯度
    """

    def __init__(self, model, val_loader, device='npu'):
        self.model = model
        self.val_loader = val_loader
        self.device = device
        self.layer_results = {}

    def measure_baseline(self):
        """测量 FP32 基线精度"""
        self.model.eval()
        correct = 0
        total = 0

        with torch.no_grad():
            for data, target in self.val_loader:
                data, target = data.to(self.device), target.to(self.device)
                output = self.model(data)
                _, predicted = output.max(1)
                correct += predicted.eq(target).sum().item()
                total += target.size(0)

        self.baseline_acc = 100.0 * correct / total
        print(f"FP32 Baseline Accuracy: {self.baseline_acc:.2f}%")
        return self.baseline_acc

    def analyze_layer(self, layer_name, layer_module):
        """分析单层的量化敏感度

        对目标层插入伪量化节点，测量精度变化。
        精度下降越多，该层越敏感。
        """
        # 备份原始权重
        original_weight = layer_module.weight.data.clone()

        # 量化该层权重
        quantized_weight = self._quantize_weight(original_weight)
        layer_module.weight.data = quantized_weight

        # 测量量化后的精度
        correct = 0
        total = 0

        with torch.no_grad():
            for data, target in self.val_loader:
                data, target = data.to(self.device), target.to(self.device)
                output = self.model(data)
                _, predicted = output.max(1)
                correct += predicted.eq(target).sum().item()
                total += target.size(0)

        quantized_acc = 100.0 * correct / total
        acc_drop = self.baseline_acc - quantized_acc

        # 恢复原始权重
        layer_module.weight.data = original_weight

        # 记录结果
        self.layer_results[layer_name] = {
            'accuracy': quantized_acc,
            'drop': acc_drop,
            'param_count': layer_module.weight.numel(),
        }

        print(f"  {layer_name}: acc={quantized_acc:.2f}%, drop={acc_drop:.2f}%")
        return acc_drop

    def analyze_all(self):
        """分析所有卷积层和线性层"""
        self.measure_baseline()

        print("\n逐层量化敏感度分析:")
        print("-" * 60)

        for name, module in self.model.named_modules():
            if isinstance(module, (nn.Conv2d, nn.Linear)):
                self.analyze_layer(name, module)

        # 按敏感度排序
        sorted_layers = sorted(
            self.layer_results.items(),
            key=lambda x: x[1]['drop'],
            reverse=True,
        )

        print("\n敏感度排名（从高到低）:")
        print("-" * 60)
        for rank, (name, result) in enumerate(sorted_layers, 1):
            print(f"  {rank}. {name}: drop={result['drop']:.2f}%, "
                  f"params={result['param_count']}")

        return sorted_layers

    def _quantize_weight(self, weight, bits=8):
        """模拟 INT8 量化"""
        n_levels = 2 ** bits - 1
        w_min = weight.min()
        w_max = weight.max()
        scale = (w_max - w_min) / n_levels
        zero_point = torch.round(-w_min / scale)

        w_quant = torch.round(weight / scale) + zero_point
        w_quant = torch.clamp(w_quant, 0, n_levels)
        w_dequant = (w_quant - zero_point) * scale

        return w_dequant

2.3 敏感度分析结果解读

def interpret_sensitivity(results, threshold=0.5):
    """解读敏感度分析结果

    参数:
        results: 敏感度分析结果
        threshold: 精度下降阈值（超过此值认为是敏感层）

    分层策略:
    - drop > threshold: 保留 FP16（敏感层）
    - drop <= threshold: 可以量化为 INT8（非敏感层）
    """
    sensitive_layers = []
    quantizable_layers = []

    for name, result in results.items():
        if result['drop'] > threshold:
            sensitive_layers.append(name)
        else:
            quantizable_layers.append(name)

    print(f"\n敏感层（保留 FP16）: {len(sensitive_layers)} 层")
    for name in sensitive_layers:
        print(f"  - {name} (drop={results[name]['drop']:.2f}%)")

    print(f"\n可量化层（INT8）: {len(quantizable_layers)} 层")
    for name in quantizable_layers:
        print(f"  - {name} (drop={results[name]['drop']:.2f}%)")

    return sensitive_layers, quantizable_layers

三、混合精度量化

3.1 混合精度策略

核心思想：不是所有层都用 INT8，敏感层保留 FP16。

分层策略：

层类型	量化策略	原因
第一层卷积	FP16	输入直接接触，误差影响大
最后一层卷积	FP16	离输出最近，误差累积最多
中间残差块	INT8	有跳跃连接，误差被稀释
全连接层	INT8	参数量大，量化收益高
BatchNorm	不量化	参数少，量化没意义

3.2 CANN 混合精度实现

class MixedPrecisionQuantizer:
    """混合精度量化器

    根据敏感度分析结果，对不同层使用不同精度。

    实现方式:
    1. 敏感层: 保持 FP16 权重，推理时用半精度
    2. 非敏感层: INT8 量化，推理时用整数计算
    3. 输出层: FP16，保证最终精度

    性能对比（ResNet-50）:
    - 全 FP32: 基线
    - 全 INT8: 速度快 2.1x，精度下降 1.2%
    - 混合精度: 速度快 1.8x，精度下降 0.3%
    """

    def __init__(self, sensitive_layers=None):
        self.sensitive_layers = sensitive_layers or []
        self.quantized_count = 0
        self.fp16_count = 0

    def apply(self, model):
        """对模型应用混合精度量化"""
        for name, module in model.named_modules():
            if name in self.sensitive_layers:
                # 敏感层：转为 FP16
                self._convert_to_fp16(module)
                self.fp16_count += 1
                print(f"  [FP16] {name}")
            elif isinstance(module, (nn.Conv2d, nn.Linear)):
                # 非敏感层：INT8 量化
                self._quantize_to_int8(module)
                self.quantized_count += 1
                print(f"  [INT8] {name}")

        print(f"\n量化统计: INT8={self.quantized_count}, FP16={self.fp16_count}")
        return model

    def _convert_to_fp16(self, module):
        """转为 FP16"""
        module.weight.data = module.weight.data.half()
        if module.bias is not None:
            module.bias.data = module.bias.data.half()

    def _quantize_to_int8(self, module):
        """INT8 量化"""
        weight = module.weight.data.float()
        n_levels = 255

        w_min = weight.min()
        w_max = weight.max()
        scale = (w_max - w_min) / n_levels
        zero_point = torch.round(-w_min / scale)

        w_quant = torch.round(weight / scale) + zero_point
        w_quant = torch.clamp(w_quant, 0, n_levels).to(torch.int8)

        # 存储量化参数
        module.weight.data = w_quant
        module._scale = scale
        module._zero_point = zero_point
        module._is_int8 = True

3.3 推理时的反量化

def dequantize_and_inference(model, input_data):
    """反量化 + 推理

    INT8 权重在计算前需要反量化回 FP16/FP32。
    这个过程很快（只是乘以 scale），不会成为瓶颈。
    """
    model.eval()

    for name, module in model.named_modules():
        if hasattr(module, '_is_int8') and module._is_int8:
            # 反量化 INT8 权重
            weight_int8 = module.weight.data.float()
            weight_fp16 = (weight_int8 - module._zero_point) * module._scale
            module.weight.data = weight_fp16.half()

    # 执行推理
    with torch.no_grad():
        output = model(input_data.half())

    return output

四、量化误差诊断工具

4.1 误差分布可视化

import matplotlib.pyplot as plt


def visualize_quantization_error(original_weight, quantized_weight, layer_name):
    """可视化量化误差分布

    好的量化:
    - 误差分布接近正态分布，均值为 0
    - 99% 的误差在 ±1% 以内

    有问题的量化:
    - 误差分布偏斜（说明 scale 选择不好）
    - 有大量大误差（说明该层不适合 INT8）
    """
    error = (quantized_weight.float() - original_weight.float()).abs()
    relative_error = error / (original_weight.abs() + 1e-8)

    fig, axes = plt.subplots(1, 3, figsize=(15, 4))

    # 绝对误差分布
    axes[0].hist(error.cpu().numpy().flatten(), bins=100, alpha=0.7)
    axes[0].set_title(f'{layer_name} - Absolute Error')
    axes[0].set_xlabel('Error')
    axes[0].set_ylabel('Count')

    # 相对误差分布
    axes[1].hist(relative_error.cpu().numpy().flatten(), bins=100, alpha=0.7)
    axes[1].set_title(f'{layer_name} - Relative Error')
    axes[1].set_xlabel('Error %')
    axes[1].set_ylabel('Count')

    # 误差热力图（二维展开）
    error_2d = error.cpu().numpy().reshape(error.size(0), -1)
    im = axes[2].imshow(error_2d, aspect='auto', cmap='hot')
    axes[2].set_title(f'{layer_name} - Error Heatmap')
    plt.colorbar(im, ax=axes[2])

    plt.tight_layout()
    plt.savefig(f'quant_error_{layer_name.replace(".", "_")}.png', dpi=150)
    plt.show()

    # 统计信息
    print(f"\n{layer_name} 量化误差统计:")
    print(f"  Mean: {error.mean().item():.6f}")
    print(f"  Max:  {error.max().item():.6f}")
    print(f"  99th percentile: {torch.quantile(error.flatten(), 0.99).item():.6f}")
    print(f"  Relative error: {relative_error.mean().item():.4%}")

4.2 输出对比分析

def compare_outputs(model_fp32, model_int8, input_data, top_k=5):
    """对比 FP32 和 INT8 模型的输出

    除了最终精度，还需要关注:
    1. 输出分布的 cosine similarity
    2. Top-K 预测的一致率
    3. 置信度的变化
    """
    # FP32 输出
    with torch.no_grad():
        output_fp32 = model_fp32(input_data.float())

    # INT8 输出
    with torch.no_grad():
        output_int8 = model_int8(input_data.half())

    # Cosine similarity
    cos_sim = torch.nn.functional.cosine_similarity(
        output_fp32.flatten(), output_int8.flatten(), dim=0
    )

    # Top-K 一致率
    _, pred_fp32 = output_fp32.topk(top_k, dim=1)
    _, pred_int8 = output_int8.topk(top_k, dim=1)
    consistency = (pred_fp32 == pred_int8).float().mean().item()

    # 置信度变化
    conf_fp32 = torch.softmax(output_fp32, dim=1).max(dim=1)[0].mean()
    conf_int8 = torch.softmax(output_int8, dim=1).max(dim=1)[0].mean()

    print(f"输出对比:")
    print(f"  Cosine Similarity: {cos_sim.item():.6f}")
    print(f"  Top-{top_k} 一致率: {consistency:.2%}")
    print(f"  FP32 平均置信度: {conf_fp32.item():.4f}")
    print(f"  INT8 平均置信度: {conf_int8.item():.4f}")

    return {
        'cosine_similarity': cos_sim.item(),
        'topk_consistency': consistency,
        'fp32_confidence': conf_fp32.item(),
        'int8_confidence': conf_int8.item(),
    }

五、完整调优流程

def precision_tuning_pipeline(model, train_loader, val_loader, device='npu'):
    """精度调优完整流程

    步骤:
    1. 测量 FP32 基线精度
    2. 逐层敏感度分析
    3. 确定混合精度方案
    4. 应用混合精度量化
    5. 输出对比验证
    """
    print("=" * 60)
    print("Step 1: FP32 Baseline")
    print("=" * 60)

    analyzer = LayerSensitivityAnalyzer(model, val_loader, device)
    analyzer.measure_baseline()

    print("\n" + "=" * 60)
    print("Step 2: Layer Sensitivity Analysis")
    print("=" * 60)

    results = analyzer.analyze_all()
    sensitive, quantizable = interpret_sensitivity(results, threshold=0.5)

    print("\n" + "=" * 60)
    print("Step 3: Apply Mixed Precision")
    print("=" * 60)

    quantizer = MixedPrecisionQuantizer(sensitive_layers=sensitive)
    model_mixed = quantizer.apply(model)

    print("\n" + "=" * 60)
    print("Step 4: Verify Output")
    print("=" * 60)

    # 对比输出
    sample_input = next(iter(val_loader))[0][:1].to(device)
    compare_outputs(model, model_mixed, sample_input)

    return model_mixed

六、常见问题

问题	原因	解决方案
全 INT8 精度下降太多	敏感层也被量化了	用混合精度，敏感层保留 FP16
混合精度没有加速	FP16 层太多	调整敏感度阈值，让更多层量化
量化后输出全错	scale 计算错误	检查 min/max 计算，用 per-channel 量化
某些层误差特别大	权重分布有异常值	用 percentile 截断代替 min-max