【深度学习精通】第25章 | 模型压缩与加速 - 剪枝、量化与知识蒸馏
环境声明
- Python版本:Python 3.10+
- PyTorch版本:PyTorch 2.0+
- 推荐硬件:NVIDIA GPU (支持CUDA 11.8+)
- 开发工具:PyCharm / VS Code / Jupyter Notebook
- 操作系统:Windows / macOS / Linux (通用)
摘要
本章深入探讨深度学习模型压缩与加速的核心技术,涵盖模型剪枝、量化、知识蒸馏、低秩分解等经典方法,以及2025年最前沿的AWQ、GPTQ、SmoothQuant等大语言模型量化技术。通过完整的PyTorch代码实现,帮助读者掌握将大型神经网络部署到资源受限环境的实战技能。
学习目标
- 理解模型压缩的核心动机与技术分类
- 掌握结构化与非结构化剪枝的实现方法
- 深入理解INT8/INT4量化原理与PTQ/QAT技术
- 学会设计教师-学生知识蒸馏框架
- 了解低秩分解与矩阵分解技术
- 掌握2025年最新的大模型量化技术(AWQ、GPTQ、SmoothQuant)
- 学会使用TensorRT、ONNX Runtime进行边缘部署
1. 模型压缩概述
1.1 为什么需要模型压缩
深度学习模型规模呈指数级增长。GPT-3拥有1750亿参数,GPT-4估计超过1万亿参数。这种规模带来两个核心问题:
存储问题:一个FP32精度的10亿参数模型需要4GB存储空间,而INT8量化后仅需1GB。
推理延迟:大模型推理需要大量内存带宽和计算资源,在移动设备上几乎无法运行。
能耗问题:大模型推理消耗大量电能,不利于边缘设备部署。
1.2 模型压缩技术分类
| 技术类别 | 压缩对象 | 典型方法 | 压缩比 |
|---|---|---|---|
| 剪枝 | 权重/神经元 | 幅度剪枝、迭代剪枝、结构化剪枝 | 2x-10x |
| 量化 | 数值精度 | INT8、INT4、二值化、FP16 | 2x-32x |
| 知识蒸馏 | 模型架构 | 教师-学生框架、特征蒸馏 | 模型变小 |
| 低秩分解 | 权重矩阵 | SVD、Tensor-Train、CP分解 | 2x-5x |
| NAS压缩 | 网络结构 | 高效块搜索、通道剪枝 | 依结构而定 |
2. 模型剪枝
2.1 剪枝的基本概念
剪枝(Pruning)通过移除神经网络中不重要的权重或神经元来减小模型规模。类比人脑:婴儿时期神经元连接密集,成年后通过"修剪"形成高效连接。
非结构化剪枝:移除单个权重,形成稀疏矩阵。压缩比高,但需要专用硬件支持。
结构化剪枝:移除整个滤波器或通道,保持规则结构。易于部署,但压缩比相对较低。
2.2 幅度剪枝(Magnitude Pruning)
幅度剪枝是最简单的剪枝方法:移除绝对值最小的权重。
import torch
import torch.nn as nn
import numpy as np
class MagnitudePruner:
"""幅度剪枝实现"""
def __init__(self, model, pruning_ratio=0.5):
self.model = model
self.pruning_ratio = pruning_ratio
self.masks = {}
def compute_masks(self):
"""计算剪枝掩码"""
for name, param in self.model.named_parameters():
if 'weight' in name and len(param.shape) > 1:
# 计算权重的绝对值
weight_abs = torch.abs(param.data)
# 计算阈值
flat_weight = weight_abs.flatten()
k = int(self.pruning_ratio * flat_weight.numel())
if k > 0:
threshold = torch.topk(flat_weight, k, largest=False)[0][-1]
# 创建掩码:大于阈值的保留
mask = (weight_abs > threshold).float()
else:
mask = torch.ones_like(param.data)
self.masks[name] = mask
def apply_masks(self):
"""应用剪枝掩码"""
for name, param in self.model.named_parameters():
if name in self.masks:
param.data *= self.masks[name]
def get_sparsity(self):
"""计算模型稀疏度"""
total_params = 0
zero_params = 0
for name, param in self.model.named_parameters():
if name in self.masks:
total_params += param.numel()
zero_params += (self.masks[name] == 0).sum().item()
return zero_params / total_params if total_params > 0 else 0
# 示例:对简单CNN进行幅度剪枝
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.fc1 = nn.Linear(64 * 7 * 7, 128)
self.fc2 = nn.Linear(128, 10)
self.relu = nn.ReLU()
self.pool = nn.MaxPool2d(2, 2)
def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = self.pool(self.relu(self.conv2(x)))
x = x.view(x.size(0), -1)
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
# 使用示例
model = SimpleCNN()
pruner = MagnitudePruner(model, pruning_ratio=0.5)
pruner.compute_masks()
pruner.apply_masks()
sparsity = pruner.get_sparsity()
print(f"模型稀疏度: {sparsity:.2%}")
2.3 迭代剪枝与微调
一次性剪枝会严重损伤模型性能。迭代剪枝通过多次小幅度剪枝和微调,逐步达到目标稀疏度。
class IterativePruner:
"""迭代剪枝实现"""
def __init__(self, model, target_sparsity=0.9, num_iterations=10):
self.model = model
self.target_sparsity = target_sparsity
self.num_iterations = num_iterations
self.current_sparsity = 0
def prune_step(self, train_loader, optimizer, criterion, device):
"""单次剪枝步骤"""
# 计算本次目标稀疏度
sparsity_per_step = self.target_sparsity / self.num_iterations
self.current_sparsity = min(
self.current_sparsity + sparsity_per_step,
self.target_sparsity
)
# 执行剪枝
pruner = MagnitudePruner(self.model, self.current_sparsity)
pruner.compute_masks()
pruner.apply_masks()
# 微调
self.model.train()
for epoch in range(3): # 每轮剪枝后微调3个epoch
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = self.model(data)
loss = criterion(output, target)
loss.backward()
# 应用掩码梯度
for name, param in self.model.named_parameters():
if name in pruner.masks:
param.grad *= pruner.masks[name]
optimizer.step()
# 重新应用掩码(防止权重更新破坏稀疏性)
pruner.apply_masks()
return pruner.get_sparsity()
def prune(self, train_loader, optimizer, criterion, device):
"""完整迭代剪枝流程"""
for iteration in range(self.num_iterations):
sparsity = self.prune_step(train_loader, optimizer, criterion, device)
print(f"迭代 {iteration + 1}/{self.num_iterations}, 稀疏度: {sparsity:.2%}")
2.4 结构化剪枝
结构化剪枝移除整个滤波器或通道,无需专用稀疏计算库。
class StructuredPruner:
"""结构化通道剪枝"""
def __init__(self, model):
self.model = model
self.pruned_channels = {}
def compute_channel_importance(self, conv_layer):
"""计算卷积层各通道的重要性(使用L1范数)"""
weight = conv_layer.weight.data # [out_channels, in_channels, H, W]
# 计算每个输出通道的L1范数
importance = torch.sum(torch.abs(weight), dim=[1, 2, 3])
return importance
def prune_conv_layer(self, conv_layer, bn_layer, pruning_ratio):
"""剪枝卷积层和对应的BN层"""
importance = self.compute_channel_importance(conv_layer)
num_channels = len(importance)
num_keep = int(num_channels * (1 - pruning_ratio))
# 保留最重要的通道
_, keep_indices = torch.topk(importance, num_keep, largest=True)
keep_indices = keep_indices.sort()[0]
# 创建新的卷积层
new_conv = nn.Conv2d(
in_channels=conv_layer.in_channels,
out_channels=num_keep,
kernel_size=conv_layer.kernel_size,
stride=conv_layer.stride,
padding=conv_layer.padding,
bias=conv_layer.bias is not None
)
# 复制保留的权重
new_conv.weight.data = conv_layer.weight.data[keep_indices]
if conv_layer.bias is not None:
new_conv.bias.data = conv_layer.bias.data[keep_indices]
# 剪枝BN层
if bn_layer is not None:
new_bn = nn.BatchNorm2d(num_keep)
new_bn.weight.data = bn_layer.weight.data[keep_indices]
new_bn.bias.data = bn_layer.bias.data[keep_indices]
new_bn.running_mean = bn_layer.running_mean[keep_indices]
new_bn.running_var = bn_layer.running_var[keep_indices]
return new_conv, new_bn, keep_indices
return new_conv, None, keep_indices
3. 模型量化
3.1 量化的基本概念
量化(Quantization)将模型权重和激活从浮点数(FP32)转换为低精度整数(INT8/INT4),显著降低存储和计算需求。
对称量化:零点为0,使用单一缩放因子。
非对称量化:零点可偏移,使用缩放因子和零点两个参数。
class Quantizer:
"""基础量化器实现"""
@staticmethod
def symmetric_quantize(tensor, num_bits=8):
"""对称量化"""
qmin = -(2 ** (num_bits - 1))
qmax = 2 ** (num_bits - 1) - 1
# 计算缩放因子
max_val = torch.max(torch.abs(tensor))
scale = max_val / qmax if max_val != 0 else 1.0
# 量化
quantized = torch.clamp(torch.round(tensor / scale), qmin, qmax)
return quantized, scale
@staticmethod
def asymmetric_quantize(tensor, num_bits=8):
"""非对称量化"""
qmin = 0
qmax = 2 ** num_bits - 1
min_val = torch.min(tensor)
max_val = torch.max(tensor)
# 计算缩放因子和零点
scale = (max_val - min_val) / (qmax - qmin) if max_val != min_val else 1.0
zero_point = qmin - torch.round(min_val / scale)
zero_point = torch.clamp(zero_point, qmin, qmax)
# 量化
quantized = torch.clamp(
torch.round(tensor / scale + zero_point),
qmin, qmax
)
return quantized, scale, zero_point
@staticmethod
def dequantize(quantized, scale, zero_point=None):
"""反量化"""
if zero_point is None:
return quantized * scale
else:
return (quantized - zero_point) * scale
3.2 训练后量化(PTQ)
PTQ在模型训练完成后进行量化,无需重新训练,速度快但可能损失精度。
class PostTrainingQuantizer:
"""训练后量化实现"""
def __init__(self, model, num_bits=8):
self.model = model
self.num_bits = num_bits
self.scales = {}
self.zero_points = {}
def calibrate(self, dataloader, num_batches=100):
"""使用校准数据收集激活统计信息"""
self.model.eval()
activation_stats = {}
# 注册前向钩子收集激活
handles = []
def get_activation(name):
def hook(module, input, output):
if name not in activation_stats:
activation_stats[name] = []
activation_stats[name].append(output.detach().cpu())
return hook
# 为所有卷积和线性层注册钩子
for name, module in self.model.named_modules():
if isinstance(module, (nn.Conv2d, nn.Linear)):
handle = module.register_forward_hook(get_activation(name))
handles.append(handle)
# 前向传播收集统计信息
with torch.no_grad():
for batch_idx, (data, _) in enumerate(dataloader):
if batch_idx >= num_batches:
break
self.model(data)
# 移除钩子
for handle in handles:
handle.remove()
# 计算每层的量化参数
for name, activations in activation_stats.items():
all_activations = torch.cat([a.flatten() for a in activations])
_, scale, zero_point = Quantizer.asymmetric_quantize(
all_activations, self.num_bits
)
self.scales[name] = scale
self.zero_points[name] = zero_point
def quantize_model(self):
"""量化模型权重"""
for name, module in self.model.named_modules():
if isinstance(module, (nn.Conv2d, nn.Linear)):
# 量化权重
quantized_weight, scale, zero_point = Quantizer.asymmetric_quantize(
module.weight.data, self.num_bits
)
# 存储量化权重(实际应用中应使用INT8存储)
module.weight.data = Quantizer.dequantize(
quantized_weight, scale, zero_point
)
3.3 量化感知训练(QAT)
QAT在训练过程中模拟量化效果,使模型适应低精度表示。
class QuantizationAwareTraining:
"""量化感知训练实现"""
def __init__(self, model, num_bits=8):
self.model = model
self.num_bits = num_bits
def fake_quantize(self, tensor, scale, zero_point):
"""伪量化:模拟量化-反量化过程"""
qmin = 0
qmax = 2 ** self.num_bits - 1
# 量化
quantized = torch.clamp(
torch.round(tensor / scale + zero_point),
qmin, qmax
)
# 反量化(梯度会通过此操作传播)
dequantized = (quantized - zero_point) * scale
return dequantized
def enable_qat(self):
"""为模型启用QAT"""
for module in self.model.modules():
if isinstance(module, (nn.Conv2d, nn.Linear)):
# 添加伪量化前向钩子
original_forward = module.forward
def qat_forward(self, x, orig_forward=original_forward):
# 伪量化权重
w_scale = torch.max(torch.abs(self.weight)) / (2 ** (self.num_bits - 1) - 1)
fake_weight = self.fake_quantize(self.weight, w_scale, 0)
# 伪量化输入
x_scale = torch.max(torch.abs(x)) / (2 ** (self.num_bits - 1) - 1)
fake_x = self.fake_quantize(x, x_scale, 0)
# 使用伪量化后的值进行计算
return orig_forward(fake_x)
module.forward = lambda self, x, of=original_forward: qat_forward(self, x, of)
4. 知识蒸馏
4.1 知识蒸馏原理
知识蒸馏(Knowledge Distillation)将大模型(教师)的知识迁移到小模型(学生)。核心思想是让学生学习教师的"软标签"而非硬标签。
软标签:教师模型输出的概率分布,包含类别间的相似性信息。
温度参数:控制软标签的平滑程度,高温使概率分布更平滑,传递更多知识。
4.2 教师-学生框架实现
import torch.nn.functional as F
class KnowledgeDistillation:
"""知识蒸馏实现"""
def __init__(self, teacher_model, student_model, temperature=4.0, alpha=0.7):
self.teacher_model = teacher_model
self.student_model = student_model
self.temperature = temperature
self.alpha = alpha # 蒸馏损失权重
def distillation_loss(self, student_logits, teacher_logits, labels):
"""计算蒸馏损失"""
# 软目标损失(KL散度)
soft_targets = F.softmax(teacher_logits / self.temperature, dim=1)
soft_prob = F.log_softmax(student_logits / self.temperature, dim=1)
distillation_loss = F.kl_div(
soft_prob, soft_targets, reduction='batchmean'
) * (self.temperature ** 2)
# 硬目标损失(交叉熵)
hard_loss = F.cross_entropy(student_logits, labels)
# 组合损失
loss = self.alpha * distillation_loss + (1 - self.alpha) * hard_loss
return loss
def train_step(self, data, labels, optimizer):
"""单步训练"""
self.teacher_model.eval()
self.student_model.train()
# 教师前向(不计算梯度)
with torch.no_grad():
teacher_logits = self.teacher_model(data)
# 学生前向
student_logits = self.student_model(data)
# 计算损失
loss = self.distillation_loss(student_logits, teacher_logits, labels)
# 反向传播
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
# 特征蒸馏实现
class FeatureDistillation:
"""特征层蒸馏"""
def __init__(self, teacher_model, student_model, feature_layers):
"""
feature_layers: 指定要蒸馏的特征层名称对
例如:[('teacher.layer1', 'student.layer1'), ...]
"""
self.teacher_model = teacher_model
self.student_model = student_model
self.feature_layers = feature_layers
self.teacher_features = {}
self.student_features = {}
# 注册特征提取钩子
self._register_hooks()
def _register_hooks(self):
"""注册特征提取钩子"""
def get_feature(storage, name):
def hook(module, input, output):
storage[name] = output
return hook
for t_name, s_name in self.feature_layers:
# 获取对应层
t_layer = dict(self.teacher_model.named_modules())[t_name]
s_layer = dict(self.student_model.named_modules())[s_name]
# 注册钩子
t_layer.register_forward_hook(get_feature(self.teacher_features, t_name))
s_layer.register_forward_hook(get_feature(self.student_features, s_name))
def feature_loss(self):
"""计算特征蒸馏损失"""
loss = 0
for t_name, s_name in self.feature_layers:
t_feat = self.teacher_features[t_name]
s_feat = self.student_features[s_name]
# 如果维度不匹配,使用适配层
if t_feat.shape != s_feat.shape:
# 使用自适应平均池化匹配空间维度
if t_feat.dim() == 4: # 卷积特征 [B, C, H, W]
s_feat = F.adaptive_avg_pool2d(s_feat, t_feat.shape[2:])
# 使用1x1卷积匹配通道维度
if t_feat.shape[1] != s_feat.shape[1]:
adapt_conv = nn.Conv2d(
s_feat.shape[1], t_feat.shape[1], 1
).to(s_feat.device)
s_feat = adapt_conv(s_feat)
# 计算L2损失
loss += F.mse_loss(s_feat, t_feat.detach())
return loss
4.3 完整蒸馏训练流程
def train_with_distillation(teacher, student, train_loader, test_loader,
epochs=50, lr=0.001, device='cuda'):
"""完整的知识蒸馏训练流程"""
teacher = teacher.to(device)
student = student.to(device)
# 冻结教师模型
for param in teacher.parameters():
param.requires_grad = False
optimizer = torch.optim.Adam(student.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, epochs)
kd = KnowledgeDistillation(teacher, student, temperature=4.0, alpha=0.7)
best_acc = 0
for epoch in range(epochs):
student.train()
total_loss = 0
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
loss = kd.train_step(data, target, optimizer)
total_loss += loss
scheduler.step()
# 验证
student.eval()
correct = 0
total = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = student(data)
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
acc = 100 * correct / total
avg_loss = total_loss / len(train_loader)
print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}, Acc: {acc:.2f}%")
if acc > best_acc:
best_acc = acc
torch.save(student.state_dict(), 'best_student.pth')
print(f"最佳准确率: {best_acc:.2f}%")
return student
5. 低秩分解与矩阵分解
5.1 奇异值分解(SVD)
SVD将权重矩阵分解为三个矩阵的乘积,通过截断奇异值实现压缩。
class SVDFactorization:
"""SVD低秩分解"""
@staticmethod
def decompose_linear_layer(linear_layer, rank_ratio=0.5):
"""分解线性层"""
weight = linear_layer.weight.data
bias = linear_layer.bias.data if linear_layer.bias is not None else None
in_features = linear_layer.in_features
out_features = linear_layer.out_features
rank = int(min(in_features, out_features) * rank_ratio)
# 执行SVD
U, S, Vh = torch.linalg.svd(weight, full_matrices=False)
# 截断奇异值
U_r = U[:, :rank]
S_r = S[:rank]
Vh_r = Vh[:rank, :]
# 创建两个低秩线性层
layer1 = nn.Linear(in_features, rank, bias=False)
layer2 = nn.Linear(rank, out_features, bias=bias is not None)
# 设置权重
layer1.weight.data = torch.mm(torch.diag(torch.sqrt(S_r)), Vh_r)
layer2.weight.data = torch.mm(U_r, torch.diag(torch.sqrt(S_r)))
if bias is not None:
layer2.bias.data = bias
return nn.Sequential(layer1, layer2)
@staticmethod
def compress_model(model, rank_ratio=0.5):
"""压缩整个模型"""
for name, module in model.named_children():
if isinstance(module, nn.Linear):
# 替换为分解后的层
setattr(model, name,
SVDFactorization.decompose_linear_layer(module, rank_ratio))
else:
# 递归处理子模块
SVDFactorization.compress_model(module, rank_ratio)
return model
5.2 Tensor-Train分解
Tensor-Train格式将高维张量表示为一系列三维张量的乘积,适合压缩卷积核。
class TensorTrain:
"""Tensor-Train分解"""
@staticmethod
def tt_decompose(tensor, ranks):
"""
将张量分解为Tensor-Train格式
tensor: 输入张量
ranks: TT秩列表 [r0, r1, ..., rd],其中r0=rd=1
"""
d = len(tensor.shape)
cores = []
# 逐步分解
residual = tensor
for k in range(d - 1):
# 将张量reshape为矩阵
shape = residual.shape
residual = residual.reshape(ranks[k] * shape[0], -1)
# SVD分解
U, S, Vh = torch.linalg.svd(residual, full_matrices=False)
# 截断
rank = min(ranks[k + 1], len(S))
U = U[:, :rank]
S = S[:rank]
Vh = Vh[:rank, :]
# 保存核心张量
core = U.reshape(ranks[k], shape[0], rank)
cores.append(core)
# 更新残差
residual = torch.mm(torch.diag(S), Vh)
residual = residual.reshape(rank, *shape[1:])
# 最后一个核心
cores.append(residual.reshape(ranks[-2], shape[-1], ranks[-1]))
return cores
@staticmethod
def tt_reconstruct(cores):
"""从TT核心重构张量"""
result = cores[0]
for core in cores[1:]:
# 收缩前一个结果的最后一个维度与当前核心的第一个维度
result = torch.tensordot(result, core, dims=([-1], [0]))
# 去除首尾维度(应为1)
return result.squeeze(0).squeeze(-1)
6. 神经架构搜索用于压缩
6.1 基于强化学习的NAS
class CompressionNAS:
"""用于模型压缩的神经架构搜索"""
def __init__(self, base_model, target_latency, target_accuracy):
self.base_model = base_model
self.target_latency = target_latency
self.target_accuracy = target_accuracy
# 定义搜索空间:每层的压缩选项
self.search_space = {
'pruning_ratio': [0.0, 0.3, 0.5, 0.7],
'quantization_bits': [32, 16, 8],
'channel_width': [0.5, 0.75, 1.0]
}
def sample_architecture(self):
"""随机采样架构配置"""
config = {}
for layer_name in self.get_layer_names():
config[layer_name] = {
'pruning_ratio': random.choice(self.search_space['pruning_ratio']),
'quantization_bits': random.choice(self.search_space['quantization_bits']),
'channel_width': random.choice(self.search_space['channel_width'])
}
return config
def evaluate_architecture(self, config):
"""评估架构性能"""
# 构建压缩模型
compressed_model = self.build_model(config)
# 测量延迟
latency = self.measure_latency(compressed_model)
# 评估准确率
accuracy = self.evaluate_accuracy(compressed_model)
# 计算奖励
reward = self.compute_reward(latency, accuracy)
return reward, latency, accuracy
def compute_reward(self, latency, accuracy):
"""计算架构奖励"""
# 满足约束时奖励为正
if latency <= self.target_latency and accuracy >= self.target_accuracy:
return accuracy - 0.1 * (latency / self.target_latency)
else:
return -1.0
7. 2025年最新进展:大语言模型量化
7.1 AWQ:激活感知权重量化
AWQ(Activation-aware Weight Quantization)是MIT在MLSys 2024提出的量化方法,核心思想是:并非所有权重对模型输出的贡献都相等,一小部分显著权重(约0.1%-1%)对量化误差特别敏感。
核心原理:
- 通过观察激活分布识别显著权重通道
- 对显著权重进行缩放保护,减少量化误差
- 实现W4A16(4位权重,16位激活)量化
class AWQQuantizer:
"""AWQ激活感知权重量化实现"""
def __init__(self, model, num_bits=4, group_size=128):
self.model = model
self.num_bits = num_bits
self.group_size = group_size
self.scales = {}
def compute_activation_scales(self, calib_data):
"""计算激活缩放因子以识别显著权重"""
activation_stats = {}
def hook_fn(name):
def hook(module, input, output):
if name not in activation_stats:
activation_stats[name] = []
# 计算每个输入通道的平均激活幅度
act = input[0].detach().abs().mean(dim=0).mean(dim=0)
activation_stats[name].append(act)
return hook
# 注册钩子
handles = []
for name, module in self.model.named_modules():
if isinstance(module, nn.Linear):
handles.append(module.register_forward_hook(hook_fn(name)))
# 前向传播收集统计
self.model.eval()
with torch.no_grad():
for data in calib_data:
self.model(data)
for handle in handles:
handle.remove()
# 计算缩放因子
for name, acts in activation_stats.items():
avg_act = torch.stack(acts).mean(dim=0)
# 显著权重对应较大的激活值
self.scales[name] = avg_act / avg_act.max()
def apply_awq_quantization(self, layer_name, weight, scale):
"""应用AWQ量化"""
# 对显著权重进行缩放保护
scaled_weight = weight * (1 + scale.unsqueeze(0))
# 分组量化
num_groups = weight.shape[1] // self.group_size
quantized_weight = torch.zeros_like(scaled_weight)
for g in range(num_groups):
start = g * self.group_size
end = start + self.group_size
w_group = scaled_weight[:, start:end]
# 计算该组的缩放因子
w_max = w_group.abs().max()
qmax = 2 ** (self.num_bits - 1) - 1
scale_factor = w_max / qmax
# 量化
q_weight = torch.clamp(
torch.round(w_group / scale_factor),
-(2 ** (self.num_bits - 1)),
2 ** (self.num_bits - 1) - 1
)
quantized_weight[:, start:end] = q_weight * scale_factor
# 反缩放
final_weight = quantized_weight / (1 + scale.unsqueeze(0))
return final_weight
7.2 GPTQ:基于梯度的量化
GPTQ(General-purpose Post-Training Quantization)是一种逐层量化方法,利用Hessian矩阵信息来最小化量化误差。
核心原理:
- 使用OBS(Optimal Brain Surgeon)框架
- 逐层量化,考虑已量化权重对后续权重的影响
- 支持任意位宽量化(主要使用4-bit)
class GPTQQuantizer:
"""GPTQ量化实现"""
def __init__(self, num_bits=4, group_size=128, actorder=True):
self.num_bits = num_bits
self.group_size = group_size
self.actorder = actorder # 是否按激活幅度排序
def quantize_layer(self, layer, inputs):
"""量化单个层"""
W = layer.weight.data.clone()
# 计算Hessian矩阵的逆(使用Fisher信息近似)
H = self.compute_hessian(inputs)
H_inv = torch.linalg.cholesky_inverse(torch.linalg.cholesky(H + 0.01 * torch.eye(H.shape[0])))
# 按激活幅度排序(减少量化误差累积)
if self.actorder:
perm = torch.argsort(torch.diag(H), descending=True)
W = W[:, perm]
H_inv = H_inv[perm, :][:, perm]
# 逐列量化
Q = torch.zeros_like(W)
for i in range(W.shape[1]):
# 量化当前列
q_col = self.quantize_column(W[:, i], self.num_bits)
Q[:, i] = q_col
# 更新剩余列(误差补偿)
err = (W[:, i] - q_col).unsqueeze(1)
if i < W.shape[1] - 1:
W[:, i+1:] -= err @ H_inv[i, i+1:].unsqueeze(0)
# 恢复原始顺序
if self.actorder:
inv_perm = torch.argsort(perm)
Q = Q[:, inv_perm]
layer.weight.data = Q
return layer
def compute_hessian(self, inputs):
"""计算Hessian矩阵近似"""
# H = X^T X / n
H = torch.zeros(inputs.shape[1], inputs.shape[1])
for x in inputs:
H += torch.outer(x, x)
H /= len(inputs)
return H
def quantize_column(self, column, num_bits):
"""量化单列"""
qmax = 2 ** (num_bits - 1) - 1
scale = column.abs().max() / qmax
quantized = torch.clamp(
torch.round(column / scale),
-(2 ** (num_bits - 1)),
2 ** (num_bits - 1) - 1
)
return quantized * scale
7.3 SmoothQuant:激活平滑量化
SmoothQuant是MIT提出的W8A8量化方法,通过数学变换将量化难度从激活转移到权重。
核心原理:
- 观察发现:激活值分布比权重更难量化(存在异常值)
- 引入缩放因子s,将激活除s、权重乘s
- 实现无损W8A8量化
class SmoothQuant:
"""SmoothQuant实现"""
def __init__(self, alpha=0.5):
"""
alpha: 迁移强度,0.5表示均衡迁移
"""
self.alpha = alpha
self.scales = {}
def compute_smooth_scales(self, model, calib_data):
"""计算平滑缩放因子"""
activation_stats = {}
weight_stats = {}
# 收集激活和权重统计
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
# 权重统计
weight_stats[name] = module.weight.data.abs().max(dim=0)[0]
# 收集激活统计(通过前向传播)
def hook_fn(name):
def hook(module, input, output):
if name not in activation_stats:
activation_stats[name] = []
act_max = input[0].abs().max(dim=0)[0]
activation_stats[name].append(act_max)
return hook
handles = []
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
handles.append(module.register_forward_hook(hook_fn(name)))
model.eval()
with torch.no_grad():
for data in calib_data:
model(data)
for handle in handles:
handle.remove()
# 计算平滑因子
for name in weight_stats.keys():
if name in activation_stats:
act_max = torch.stack(activation_stats[name]).max(dim=0)[0]
w_max = weight_stats[name]
# SmoothQuant公式: s = (act_max^alpha) / (w_max^(1-alpha))
self.scales[name] = torch.pow(act_max, self.alpha) / torch.pow(w_max + 1e-8, 1 - self.alpha)
def apply_smoothing(self, model):
"""应用平滑变换"""
for name, module in model.named_modules():
if isinstance(module, nn.Linear) and name in self.scales:
scale = self.scales[name]
# 权重乘s
module.weight.data = module.weight.data * scale.unsqueeze(0)
# 在推理时,激活需要除s(通过融合到前一层的BN或单独处理)
# 这里仅演示权重变换
7.4 LLM.int8():8位矩阵乘法
LLM.int8()由Tim Dettmers提出,核心发现是大语言模型中存在"涌现特征"(emergent features)——少量维度具有极大值。
核心原理:
- 将矩阵乘法分解为向量-wise和矩阵-wise两部分
- 对包含异常值的向量使用FP16计算
- 其余部分使用INT8计算
class LLMInt8:
"""LLM.int8()实现"""
def __init__(self, threshold=6.0):
"""
threshold: 异常值阈值(以标准差为单位)
"""
self.threshold = threshold
def int8_matmul(self, A, B):
"""
8位矩阵乘法,处理异常值
A: 激活 [batch, in_features]
B: 权重 [out_features, in_features]
"""
# 计算每行的统计信息
A_scale = A.abs().max(dim=-1, keepdim=True)[0] / 127
B_scale = B.abs().max(dim=-1, keepdim=True)[0] / 127
# 量化到INT8
A_int8 = torch.round(A / A_scale).to(torch.int8)
B_int8 = torch.round(B / B_scale).to(torch.int8)
# 识别异常值(基于列的统计)
outlier_cols = self.find_outlier_columns(A)
# 分离异常值
if outlier_cols.any():
# 正常部分使用INT8
A_normal = A_int8[:, ~outlier_cols]
B_normal = B_int8[:, ~outlier_cols]
# 异常部分使用FP16
A_outlier = A[:, outlier_cols].float()
B_outlier = B[:, outlier_cols].float()
# 分别计算
C_normal = torch.matmul(A_normal.float(), B_normal.float().T)
C_outlier = torch.matmul(A_outlier, B_outlier.T)
# 反量化并合并
C = C_normal * (A_scale * B_scale.T) + C_outlier
else:
# 无异常值,全部使用INT8
C_int32 = torch.matmul(A_int8.float(), B_int8.float().T)
C = C_int32 * (A_scale * B_scale.T)
return C
def find_outlier_columns(self, A):
"""识别包含异常值的列"""
# 基于标准差检测异常值
mean = A.mean(dim=0)
std = A.std(dim=0)
# 标记异常值列
outliers = ((A - mean).abs() > self.threshold * std).any(dim=0)
return outliers
8. 移动端与边缘部署
8.1 TensorRT优化
TensorRT是NVIDIA推出的深度学习推理优化器,支持层融合、精度校准、内核自动调优等。
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
class TensorRTExporter:
"""TensorRT模型导出与推理"""
def __init__(self, onnx_path, fp16_mode=True, max_batch_size=1):
self.onnx_path = onnx_path
self.fp16_mode = fp16_mode
self.max_batch_size = max_batch_size
self.engine = None
def build_engine(self):
"""构建TensorRT引擎"""
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
# 解析ONNX模型
with open(self.onnx_path, 'rb') as f:
if not parser.parse(f.read()):
for error in range(parser.num_errors):
print(parser.get_error(error))
return None
# 配置builder
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30 # 1GB
if self.fp16_mode:
config.set_flag(trt.BuilderFlag.FP16)
# 构建引擎
engine = builder.build_engine(network, config)
self.engine = engine
return engine
def infer(self, input_data):
"""执行推理"""
if self.engine is None:
self.build_engine()
context = self.engine.create_execution_context()
# 分配内存
d_input = cuda.mem_alloc(input_data.nbytes)
d_output = cuda.mem_alloc(input_data.nbytes) # 假设输出大小相同
# 数据传输
cuda.memcpy_htod(d_input, input_data)
# 执行推理
context.execute_v2([int(d_input), int(d_output)])
# 获取结果
output = np.empty_like(input_data)
cuda.memcpy_dtoh(output, d_output)
return output
# PyTorch模型导出为ONNX
def export_to_onnx(model, dummy_input, onnx_path):
"""导出PyTorch模型为ONNX格式"""
model.eval()
torch.onnx.export(
model,
dummy_input,
onnx_path,
export_params=True,
opset_version=13,
do_constant_folding=True,
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
print(f"模型已导出到: {onnx_path}")
8.2 ONNX Runtime
ONNX Runtime是微软开源的跨平台推理引擎,支持多种硬件加速后端。
import onnxruntime as ort
class ONNXRuntimeInference:
"""ONNX Runtime推理"""
def __init__(self, onnx_path, use_gpu=False):
self.onnx_path = onnx_path
# 配置会话选项
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# 选择执行提供程序
if use_gpu and 'CUDAExecutionProvider' in ort.get_available_providers():
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
else:
providers = ['CPUExecutionProvider']
# 创建会话
self.session = ort.InferenceSession(
onnx_path,
sess_options,
providers=providers
)
# 获取输入输出信息
self.input_name = self.session.get_inputs()[0].name
self.output_name = self.session.get_outputs()[0].name
def infer(self, input_data):
"""执行推理"""
# 确保输入是numpy数组
if isinstance(input_data, torch.Tensor):
input_data = input_data.cpu().numpy()
# 运行推理
outputs = self.session.run(
[self.output_name],
{self.input_name: input_data}
)
return outputs[0]
def quantize_static(self, calibration_data, quantized_path):
"""静态量化"""
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType
class DataReader(CalibrationDataReader):
def __init__(self, data):
self.data = data
self.idx = 0
def get_next(self):
if self.idx < len(self.data):
data = self.data[self.idx]
self.idx += 1
return {self.input_name: data}
return None
dr = DataReader(calibration_data)
quantize_static(
self.onnx_path,
quantized_path,
dr,
quant_format=QuantType.QInt8,
weight_type=QuantType.QInt8
)
print(f"量化模型已保存到: {quantized_path}")
8.3 Core ML(Apple设备)
Core ML是Apple设备的机器学习框架,支持iOS、macOS、watchOS和tvOS。
import coremltools as ct
def convert_to_coreml(torch_model, dummy_input, save_path):
"""将PyTorch模型转换为Core ML格式"""
# 跟踪模型
traced_model = torch.jit.trace(torch_model, dummy_input)
# 转换为Core ML
mlmodel = ct.convert(
traced_model,
inputs=[ct.ImageType(
name="input",
shape=dummy_input.shape,
bias=[-1, -1, -1], # 归一化参数
scale=1/255.0
)],
classifier_config=None,
compute_units=ct.ComputeUnit.ALL # 使用所有可用计算单元(CPU/GPU/Neural Engine)
)
# 量化(可选)
mlmodel = ct.models.neural_network.quantization_utils.quantize_weights(
mlmodel,
nbits=8
)
# 保存模型
mlmodel.save(save_path)
print(f"Core ML模型已保存到: {save_path}")
return mlmodel
9. 模型压缩实战代码
9.1 完整的模型压缩Pipeline
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
import copy
class ModelCompressionPipeline:
"""完整的模型压缩Pipeline"""
def __init__(self, model, device='cuda'):
self.original_model = model.to(device)
self.device = device
self.compressed_model = None
def step1_pruning(self, pruning_ratio=0.5, fine_tune_epochs=5, train_loader=None):
"""步骤1:结构化剪枝"""
print("=" * 50)
print("步骤1: 结构化剪枝")
print("=" * 50)
model = copy.deepcopy(self.original_model)
pruner = StructuredPruner()
# 对卷积层进行剪枝
for name, module in model.named_modules():
if isinstance(module, nn.Conv2d):
# 这里简化处理,实际应根据层的重要性决定剪枝比例
pass
# 微调
if train_loader:
self._fine_tune(model, train_loader, fine_tune_epochs)
self.compressed_model = model
return model
def step2_quantization(self, calibration_loader=None, method='ptq'):
"""步骤2:量化"""
print("=" * 50)
print("步骤2: 模型量化")
print("=" * 50)
model = copy.deepcopy(self.compressed_model or self.original_model)
if method == 'ptq':
quantizer = PostTrainingQuantizer(model, num_bits=8)
if calibration_loader:
quantizer.calibrate(calibration_loader)
quantizer.quantize_model()
elif method == 'qat':
qat = QuantizationAwareTraining(model, num_bits=8)
qat.enable_qat()
self.compressed_model = model
return model
def step3_knowledge_distillation(self, train_loader, test_loader,
student_model=None, epochs=50):
"""步骤3:知识蒸馏"""
print("=" * 50)
print("步骤3: 知识蒸馏")
print("=" * 50)
teacher = self.compressed_model or self.original_model
if student_model is None:
# 默认使用更小的模型作为学生
student_model = self._create_student_model()
student = train_with_distillation(
teacher, student_model, train_loader, test_loader,
epochs=epochs, device=self.device
)
self.compressed_model = student
return student
def step4_export(self, dummy_input, onnx_path, tensorrt_path=None):
"""步骤4:导出部署格式"""
print("=" * 50)
print("步骤4: 导出部署格式")
print("=" * 50)
model = self.compressed_model or self.original_model
model.eval()
# 导出ONNX
export_to_onnx(model, dummy_input, onnx_path)
# 可选:转换为TensorRT
if tensorrt_path:
trt_exporter = TensorRTExporter(onnx_path)
trt_exporter.build_engine()
# 保存引擎...
return onnx_path
def _fine_tune(self, model, train_loader, epochs):
"""微调模型"""
model.train()
optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()
for epoch in range(epochs):
total_loss = 0
for data, target in train_loader:
data, target = data.to(self.device), target.to(self.device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"微调 Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader):.4f}")
def _create_student_model(self):
"""创建学生模型(简化版)"""
# 这里使用更小的ResNet作为示例
return models.resnet18(num_classes=10)
def evaluate(self, test_loader):
"""评估模型"""
model = self.compressed_model or self.original_model
model.eval()
correct = 0
total = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(self.device), target.to(self.device)
output = model(data)
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
accuracy = 100 * correct / total
print(f"模型准确率: {accuracy:.2f}%")
return accuracy
# 完整使用示例
def main():
"""模型压缩完整流程示例"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 准备数据
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
train_dataset = datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
test_dataset = datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
# 加载预训练模型
model = models.resnet50(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, 10) # 适配CIFAR-10
# 创建压缩Pipeline
pipeline = ModelCompressionPipeline(model, device)
# 评估原始模型
print("原始模型评估:")
original_acc = pipeline.evaluate(test_loader)
# 执行压缩步骤
# 步骤1: 剪枝
pipeline.step1_pruning(pruning_ratio=0.3, fine_tune_epochs=3,
train_loader=train_loader)
# 步骤2: 量化
pipeline.step2_quantization(calibration_loader=train_loader, method='ptq')
# 评估压缩后模型
print("\n压缩后模型评估:")
compressed_acc = pipeline.evaluate(test_loader)
# 步骤3: 知识蒸馏(可选)
# pipeline.step3_knowledge_distillation(train_loader, test_loader, epochs=30)
# 步骤4: 导出
dummy_input = torch.randn(1, 3, 32, 32).to(device)
pipeline.step4_export(dummy_input, 'compressed_model.onnx')
print(f"\n压缩效果:")
print(f"原始模型准确率: {original_acc:.2f}%")
print(f"压缩后准确率: {compressed_acc:.2f}%")
print(f"准确率损失: {original_acc - compressed_acc:.2f}%")
if __name__ == '__main__':
main()
9.2 大模型量化实战
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
class LLMQuantizationPipeline:
"""大语言模型量化Pipeline"""
def __init__(self, model_name):
self.model_name = model_name
self.model = None
self.tokenizer = None
def load_model(self, load_in_8bit=False, load_in_4bit=False):
"""加载模型,可选择使用bitsandbytes进行量化加载"""
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
if load_in_8bit or load_in_4bit:
# 使用bitsandbytes进行量化加载
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=load_in_8bit,
load_in_4bit=load_in_4bit,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4", # 4-bit Normal Float
bnb_4bit_use_double_quant=True # 嵌套量化
)
self.model = AutoModelForCausalLM.from_pretrained(
self.model_name,
quantization_config=quantization_config,
device_map="auto"
)
else:
self.model = AutoModelForCausalLM.from_pretrained(self.model_name)
return self.model
def apply_gptq(self, dataset, bits=4, group_size=128):
"""应用GPTQ量化"""
try:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=bits,
group_size=group_size,
desc_act=False # 是否按激活顺序排列
)
# 加载并量化模型
model = AutoGPTQForCausalLM.from_pretrained(
self.model_name,
quantize_config
)
# 执行量化
model.quantize(dataset)
# 保存量化模型
model.save_quantized(f"{self.model_name}-gptq-{bits}bit")
return model
except ImportError:
print("请先安装auto-gptq: pip install auto-gptq")
return None
def apply_awq(self, bits=4, group_size=128):
"""应用AWQ量化"""
try:
from awq import AutoAWQForCausalLM
# 加载模型
model = AutoAWQForCausalLM.from_pretrained(self.model_name)
# 执行AWQ量化
model.quantize(
tokenizer=self.tokenizer,
quant_config={"zero_point": True, "q_group_size": group_size, "w_bit": bits}
)
# 保存
model.save_quantized(f"{self.model_name}-awq-{bits}bit")
return model
except ImportError:
print("请先安装awq: pip install autoawq")
return None
def benchmark(self, prompt="Hello, how are you?", max_length=50):
"""基准测试"""
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
# 测量推理时间
import time
start = time.time()
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_length=max_length,
num_return_sequences=1
)
elapsed = time.time() - start
result = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# 计算模型大小
param_size = sum(p.numel() * p.element_size() for p in self.model.parameters())
buffer_size = sum(b.numel() * b.element_size() for b in self.model.buffers())
model_size_mb = (param_size + buffer_size) / 1024**2
print(f"推理时间: {elapsed:.3f}s")
print(f"模型大小: {model_size_mb:.2f} MB")
print(f"生成结果: {result}")
return {
'time': elapsed,
'size_mb': model_size_mb,
'output': result
}
# 使用示例
def demo_llm_quantization():
"""大模型量化演示"""
# 使用较小的模型进行演示
model_name = "gpt2" # 或 "meta-llama/Llama-2-7b-hf" 等
pipeline = LLMQuantizationPipeline(model_name)
# 加载FP16模型
print("加载FP16模型...")
pipeline.load_model()
fp16_results = pipeline.benchmark()
# 加载INT8量化模型
print("\n加载INT8量化模型...")
pipeline.load_model(load_in_8bit=True)
int8_results = pipeline.benchmark()
# 加载INT4量化模型
print("\n加载INT4量化模型...")
pipeline.load_model(load_in_4bit=True)
int4_results = pipeline.benchmark()
# 对比结果
print("\n" + "=" * 50)
print("量化效果对比")
print("=" * 50)
print(f"FP16: 大小={fp16_results['size_mb']:.2f}MB, 时间={fp16_results['time']:.3f}s")
print(f"INT8: 大小={int8_results['size_mb']:.2f}MB, 时间={int8_results['time']:.3f}s")
print(f"INT4: 大小={int4_results['size_mb']:.2f}MB, 时间={int4_results['time']:.3f}s")
print(f"INT8压缩比: {fp16_results['size_mb']/int8_results['size_mb']:.2f}x")
print(f"INT4压缩比: {fp16_results['size_mb']/int4_results['size_mb']:.2f}x")
10. 避坑小贴士
10.1 剪枝相关
问题:一次性剪枝90%的权重导致模型完全失效。
解决方案:采用渐进式剪枝策略,每次剪枝不超过30%,配合微调恢复性能。使用彩票 ticket 假说指导剪枝:先训练大模型,找到"中奖"子网络。
问题:非结构化剪枝后模型速度没有提升。
解决方案:非结构化剪枝需要专用稀疏计算库(如cuSPARSE)才能加速。如需立即见效,使用结构化剪枝移除整个通道。
10.2 量化相关
问题:PTQ后模型准确率大幅下降。
解决方案:
- 使用更多校准数据(建议至少1000个样本)
- 采用逐层量化而非全局量化
- 对敏感层(如第一层和最后一层)保持FP32精度
- 考虑使用QAT进行量化感知训练
问题:INT4量化导致模型输出混乱。
解决方案:
- INT4量化对异常值极其敏感,使用AWQ或GPTQ保护显著权重
- 采用分组量化(group_size=128)而非逐张量量化
- 考虑使用NF4(4-bit Normal Float)数据类型
10.3 知识蒸馏相关
问题:学生模型无法学习教师的知识。
解决方案:
- 调整温度参数(通常T=2-8效果较好)
- 平衡软标签损失和硬标签损失(alpha=0.5-0.9)
- 确保教师和学生架构兼容(如都是CNN或都是Transformer)
- 使用中间层特征蒸馏增强监督信号
问题:蒸馏训练收敛极慢。
解决方案:
- 使用预训练的学生模型作为起点
- 先单独训练学生模型,再引入蒸馏损失
- 逐步增加蒸馏损失的权重
10.4 部署相关
问题:TensorRT转换失败。
解决方案:
- 确保ONNX opset版本兼容(推荐使用13+)
- 检查动态轴设置是否正确
- 使用polygraphy工具调试转换问题
- 某些操作(如自定义算子)可能需要插件实现
问题:移动端推理速度不达预期。
解决方案:
- 使用Neural Engine(ANE)进行推理(Core ML)
- 确保模型输入尺寸固定,避免动态形状
- 考虑使用MobileNet等轻量级架构而非压缩大模型
11. 本章小结
本章系统介绍了深度学习模型压缩与加速的核心技术:
模型剪枝:通过移除不重要权重减小模型规模。结构化剪枝易于部署,非结构化剪枝压缩比更高。迭代剪枝配合微调是最佳实践。
模型量化:将FP32转换为INT8/INT4,降低存储和计算需求。PTQ快速但可能损失精度,QAT训练成本高但效果更好。
知识蒸馏:将大模型知识迁移到小模型。软标签传递类别相似性信息,特征蒸馏提供中间层监督。
低秩分解:SVD和Tensor-Train分解减少权重矩阵参数量,适合线性层压缩。
2025年最新技术:
- AWQ:保护显著权重,实现W4A16量化
- GPTQ:基于Hessian的逐层量化
- SmoothQuant:通过数学变换平衡激活和权重量化难度
- LLM.int8():分离异常值,实现高效8位推理
部署优化:TensorRT针对NVIDIA GPU优化,ONNX Runtime跨平台,Core ML针对Apple设备优化。
12. 知识点回顾
| 概念 | 关键要点 |
|---|---|
| 幅度剪枝 | 移除绝对值最小的权重 |
| 结构化剪枝 | 移除整个滤波器/通道,保持规则结构 |
| 对称量化 | 零点为0,使用单一缩放因子 |
| 非对称量化 | 使用缩放因子和零点两个参数 |
| PTQ | 训练后量化,速度快但可能损失精度 |
| QAT | 量化感知训练,精度高但训练成本高 |
| 软标签 | 教师模型的概率输出,包含类别相似性 |
| 温度参数 | 控制软标签平滑程度,T越大越平滑 |
| AWQ | 保护显著权重,实现4位量化 |
| GPTQ | 基于OBS框架的逐层量化 |
| SmoothQuant | 将量化难度从激活迁移到权重 |
| TensorRT | NVIDIA推理优化器,支持层融合 |
参考文献
- Han S, Mao H, Dally W. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. ICLR 2016.
- Hinton G, Vinyals O, Dean J. Distilling the Knowledge in a Neural Network. NIPS 2014 Workshop.
- Lin J, Tang J, Tang H, et al. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024.
- Frantar E, Ashkboos S, Hoefler T, et al. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023.
- Xiao G, Lin J, Seznec M, et al. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. ICML 2023.
- Dettmers T, Lewis M, Belkada Y, et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022.
本文档由《深度学习精通》系列教程自动生成,转载请注明出处。
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐



所有评论(0)