第41节:Hugging Face Transformers 工具从入门到精通【第五部分:推理优化与部署篇 & 第六部分:多模态与进阶应用篇】

文章目录
【Hugging Face Transformers 工具从入门到精通】全文导读
第一部分:基础入门篇
第二部分:核心API深度实践篇
第三部分:数据处理与训练篇
第四部分:参数高效微调(PEFT)与高级训练篇
第五部分:推理优化与部署篇&第六部分:多模态与进阶应用篇
第七部分:案例实战篇
第八部分:常见问题与调试指南
第五部分:推理优化与部署篇
15. 模型推理优化
15.1 降低精度推理:torch.float16 / bfloat16
在推理阶段,模型的前向传播对数值精度要求低于训练阶段。将模型权重从默认的 float32(32位浮点数)降低到 float16(16位浮点数)或 bfloat16(Brain Float 16),可以将模型显存占用减半,同时推理速度提升 1.5-2 倍。
float16 和 bfloat16 的区别在于:float16 牺牲了数值范围(约 6e-8 到 65504),在数值极值时可能发生溢出;bfloat16 保留了与 float32 相同的指数位(8位),因此数值范围一致,但尾数位更少,适合大多数深度学习场景。
# 文件名:half_precision_inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
model_name = "Qwen/Qwen2.5-0.5B"
# 1. FP32 推理(默认)
print("=== FP32 推理 ===")
model_fp32 = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 2. FP16 推理
print("\n=== FP16 推理 ===")
model_fp16 = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # 指定为 FP16
device_map="auto"
)
# 3. BF16 推理(需要 Ampere 架构或更高 GPU)
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8:
print("\n=== BF16 推理 ===")
model_bf16 = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# 性能对比
def benchmark(model, tokenizer, prompt, device, num_runs=10):
inputs = tokenizer(prompt, return_tensors="pt").to(device)
model.eval()
# 预热
with torch.no_grad():
_ = model.generate(**inputs, max_new_tokens=10)
# 计时
start = time.time()
with torch.no_grad():
for _ in range(num_runs):
_ = model.generate(**inputs, max_new_tokens=20)
elapsed = time.time() - start
return elapsed / num_runs
prompt = "The future of artificial intelligence is"
device = "cuda" if torch.cuda.is_available() else "cpu"
# 注意:FP32 模型可能很大,此处仅演示概念
# 实际使用时,建议对同一模型在不同精度下测试
print("\n精度选择建议:")
print(" - float16: 通用选择,速度和内存优化明显")
print(" - bfloat16: 需要 Ampere GPU (A100, RTX 3090/4090 等),数值稳定性更好")
print(" - float32: 仅在精度敏感场景(如数学推理)使用")
注意事项:
- 将模型移动到 GPU 时,确保权重与设备匹配:
model.to(device)会在内部处理类型转换。 - 生成任务中,
logits的计算会自动继承模型的dtype。
15.2 模型量化:bitsandbytes 8-bit / 4-bit 量化
量化是一种更激进的精度降低方法。与简单降低精度不同,量化通常涉及额外的缩放和零点调整,以在低位宽下保持更多的信息。
bitsandbytes 库提供了高效的 8-bit 和 4-bit 量化实现,特别适合在消费级 GPU 上运行大模型。
# 文件名:quantization_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_name = "Qwen/Qwen2.5-7B" # 7B 模型,FP16 约 14GB
# 1. 8-bit 量化配置
bnb_config_8bit = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # 异常值阈值,超过此值的维度不量化
llm_int8_skip_modules=None, # 跳过量化的模块
)
# 2. 4-bit 量化配置(更激进,显存占用更低)
bnb_config_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
print("=== 加载 8-bit 量化模型 ===")
model_8bit = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config_8bit,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"8-bit 模型显存占用: {model_8bit.get_memory_footprint() / 1e9:.2f} GB")
# 推理测试
inputs = tokenizer("Explain quantum computing in simple terms:", return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model_8bit.generate(**inputs, max_new_tokens=100, do_sample=True)
print("生成结果:", tokenizer.decode(outputs[0], skip_special_tokens=True))
# 清理显存
del model_8bit
torch.cuda.empty_cache()
print("\n=== 加载 4-bit 量化模型 ===")
model_4bit = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config_4bit,
device_map="auto",
)
print(f"4-bit 模型显存占用: {model_4bit.get_memory_footprint() / 1e9:.2f} GB")
量化精度与性能权衡
| 精度 | 显存占用 (7B) | 推理速度 | 质量损失 |
|---|---|---|---|
| FP16 | ~14 GB | 基准 | 无 |
| 8-bit | ~8 GB | ~90% | <1% |
| 4-bit | ~4 GB | ~85% | 1-3% |
对于大多数应用,4-bit 量化是一个极佳的选择,可以在单张 6GB 显存的 GPU 上运行 7B 模型。
15.3 批处理推理与动态批处理策略
批处理是提升推理吞吐量的最有效手段之一。通过将多个请求合并为一个批次,可以充分利用 GPU 的并行计算能力。
# 文件名:batch_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import torch
model_name = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
# 准备多个输入
prompts = [
"What is the capital of France?",
"Explain the theory of relativity.",
"Write a haiku about spring.",
"What are the benefits of exercise?",
] * 10 # 共 40 个请求
# 1. 串行推理
print("=== 串行推理 ===")
start = time.time()
single_results = []
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50)
single_results.append(tokenizer.decode(outputs[0], skip_special_tokens=True))
serial_time = time.time() - start
print(f"串行耗时: {serial_time:.2f} 秒")
# 2. 静态批处理(所有序列填充到相同长度)
print("\n=== 静态批处理 ===")
# 首先对每个 prompt 进行 tokenization,并记录长度
encodings = [tokenizer(p, return_tensors="pt") for p in prompts]
# 计算最大长度
max_len = max(enc['input_ids'].shape[1] for enc in encodings)
print(f"最大序列长度: {max_len}")
# 使用 padding 和 truncation 批量编码
batch_inputs = tokenizer(
prompts,
padding=True, # 填充到批次内最大长度
truncation=True,
max_length=512,
return_tensors="pt"
).to(model.device)
start = time.time()
with torch.no_grad():
batch_outputs = model.generate(
**batch_inputs,
max_new_tokens=50,
pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id
)
batch_time = time.time() - start
print(f"批处理耗时: {batch_time:.2f} 秒")
print(f"加速比: {serial_time / batch_time:.2f}x")
# 3. 动态批处理(使用简单缓存策略模拟)
class DynamicBatcher:
"""简单的动态批处理器,积累请求直到达到 batch_size"""
def __init__(self, model, tokenizer, max_batch_size=8, max_wait_time=0.1):
self.model = model
self.tokenizer = tokenizer
self.max_batch_size = max_batch_size
self.max_wait_time = max_wait_time
self.queue = []
def add_request(self, prompt, callback):
self.queue.append((prompt, callback))
if len(self.queue) >= self.max_batch_size:
self.process_batch()
def process_batch(self):
if not self.queue:
return
batch = self.queue[:self.max_batch_size]
self.queue = self.queue[self.max_batch_size:]
prompts = [item[0] for item in batch]
callbacks = [item[1] for item in batch]
inputs = self.tokenizer(prompts, padding=True, truncation=True, return_tensors="pt").to(self.model.device)
with torch.no_grad():
outputs = self.model.generate(**inputs, max_new_tokens=50)
decoded = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
for cb, text in zip(callbacks, decoded):
cb(text)
print("\n动态批处理在生产环境中通常与消息队列结合使用,此处仅示意。")
批处理的最佳实践:
- 对于 LLM 生成任务,注意不同输入的输出长度差异很大,使用
pad_token_id正确填充。 - 批处理大小受限于 GPU 显存,建议从 4 开始逐步增加。
- 对于变长序列,考虑使用
padding='longest'而非固定max_length,以节省计算。
15.4 模型编译优化:torch.compile 与 Transformers
PyTorch 2.0 引入了 torch.compile,通过将模型计算图编译为优化的内核代码,可以显著提升推理速度,尤其适合 transformer 类模型。
# 文件名:torch_compile_demo.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
model_name = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to("cuda")
# 1. 未编译的模型
def inference_uncompiled(prompt):
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# 2. 使用 torch.compile 编译模型
# 注意:compile 对生成任务的优化有限,主要优化前向传播
# 对于纯前向推理(如分类),效果更显著
compiled_model = torch.compile(model, mode="reduce-overhead") # 模式: default, reduce-overhead, max-autotune
def inference_compiled(prompt):
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = compiled_model.generate(**inputs, max_new_tokens=50)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# 预热
print("预热中...")
for _ in range(3):
_ = inference_uncompiled("Hello")
_ = inference_compiled("Hello")
# 性能测试
prompt = "Explain the importance of recycling."
num_runs = 10
start = time.time()
for _ in range(num_runs):
_ = inference_uncompiled(prompt)
uncompiled_time = time.time() - start
start = time.time()
for _ in range(num_runs):
_ = inference_compiled(prompt)
compiled_time = time.time() - start
print(f"\n未编译平均耗时: {uncompiled_time/num_runs*1000:.2f} ms")
print(f"编译后平均耗时: {compiled_time/num_runs*1000:.2f} ms")
print(f"加速比: {uncompiled_time/compiled_time:.2f}x")
# 注意事项
print("\n注意事项:")
print("1. 首次编译需要时间(数十秒),适合长期运行的部署场景")
print("2. 编译对动态形状(变长序列)效果有限")
print("3. 某些模型或操作可能不兼容,需测试")
torch.compile 的三个模式:
"default":适合大多数模型,编译速度适中。"reduce-overhead":减少 Python 开销,适合小模型或频繁调用的场景。"max-autotune":尝试多种内核优化,编译时间最长,但性能最佳。
15.5 高效推理引擎对比:vLLM、TGI、llama.cpp
当需要生产级 LLM 推理服务时,专用推理引擎通常比原生 Transformers 性能更好。以下是三个主流引擎的对比:
| 特性 | vLLM | TGI (Text Generation Inference) | llama.cpp |
|---|---|---|---|
| 核心优化 | PagedAttention | Flash Attention + 连续批处理 | 4-bit 量化 + CPU/GPU 混合 |
| 吞吐量提升 | 10-20x | 5-10x | 2-5x |
| GPU 支持 | 最佳 | 最佳 | 支持但有限 |
| CPU 推理 | 不支持 | 不支持 | 最佳 |
| 量化支持 | GPTQ, AWQ | GPTQ, AWQ | GGUF |
| 部署难度 | 中等 | 中等 | 简单 |
| API 兼容性 | OpenAI API | OpenAI API | 自定义 |
# 文件名:inference_engines_comparison.py
# 本文件展示如何使用 vLLM 和 TGI,需要额外安装
# ============================================================
# vLLM 示例(需要安装:pip install vllm)
# ============================================================
"""
from vllm import LLM, SamplingParams
# 加载模型
llm = LLM(model="Qwen/Qwen2.5-7B")
# 配置采样参数
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
# 批量推理
prompts = ["Hello, how are you?", "What is the capital of France?"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Generated: {output.outputs[0].text}\n")
"""
# ============================================================
# TGI 部署(需要 Docker)
# ============================================================
"""
# 启动 TGI 服务
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id Qwen/Qwen2.5-7B \
--max-total-tokens 4096
# 调用 API
import requests
response = requests.post(
"http://localhost:8080/generate",
json={
"inputs": "Explain quantum computing.",
"parameters": {"max_new_tokens": 200, "temperature": 0.7}
}
)
print(response.json()["generated_text"])
"""
# ============================================================
# llama.cpp 示例(适合 CPU 或混合推理)
# ============================================================
"""
# 1. 转换模型为 GGUF 格式(使用 llama.cpp 的 convert.py)
# 2. 使用 llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="./qwen-7b-q4_K_M.gguf", n_ctx=4096, n_gpu_layers=35) # 将部分层放到 GPU
output = llm("Q: What is AI? A:", max_tokens=200, stop=["Q:", "\n"], echo=False)
print(output["choices"][0]["text"])
"""
print("推荐选择:")
print(" - 高吞吐 GPU 服务(多用户): vLLM")
print(" - Hugging Face 官方部署: TGI")
print(" - 消费级 GPU 或纯 CPU 推理: llama.cpp")
16. 模型导出与服务化部署
16.1 导出为 ONNX 格式与 ONNX Runtime 推理
ONNX(Open Neural Network Exchange)是一种开放的模型格式,允许模型在不同框架和硬件上运行。ONNX Runtime 是微软开发的高性能推理引擎。
# 文件名:onnx_export_inference.py
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
# ============================================================
# 方法1:使用 Optimum 库导出 ONNX(推荐)
# ============================================================
from optimum.onnxruntime import ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoOptimizationConfig
# 导出模型
ort_model = ORTModelForSequenceClassification.from_pretrained(
model_name,
export=True, # 自动导出为 ONNX
provider="CPUExecutionProvider", # 推理后端
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 使用 ONNX Runtime 推理
inputs = tokenizer("I love Hugging Face!", return_tensors="pt")
outputs = ort_model(**inputs)
print(f"ONNX Runtime 输出: {outputs.logits}")
# 保存 ONNX 模型
ort_model.save_pretrained("./onnx_model")
tokenizer.save_pretrained("./onnx_model")
# ============================================================
# 方法2:使用 torch.onnx 手动导出
# ============================================================
import torch.onnx
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
# 创建示例输入
dummy_input = tokenizer("Example text", return_tensors="pt")
input_names = ["input_ids", "attention_mask"]
output_names = ["logits"]
torch.onnx.export(
model,
(dummy_input["input_ids"], dummy_input["attention_mask"]),
"model.onnx",
input_names=input_names,
output_names=output_names,
dynamic_axes={
"input_ids": {0: "batch_size", 1: "sequence_length"},
"attention_mask": {0: "batch_size", 1: "sequence_length"},
"logits": {0: "batch_size"}
},
opset_version=14,
)
print("手动导出完成: model.onnx")
# ============================================================
# 使用 ONNX Runtime 进行推理(不依赖 Transformers)
# ============================================================
import onnxruntime as ort
# 加载 ONNX 模型
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
# 准备输入
input_ids = dummy_input["input_ids"].numpy()
attention_mask = dummy_input["attention_mask"].numpy()
# 推理
outputs = session.run(["logits"], {"input_ids": input_ids, "attention_mask": attention_mask})
print(f"ONNX Runtime 原生推理: {outputs[0]}")
ONNX 的优势:
- 跨框架部署(PyTorch, TensorFlow 模型均可转 ONNX)
- 硬件加速支持(CUDA, TensorRT, OpenVINO)
- 减少推理依赖(不需要完整的 Transformers 库)
16.2 导出为 TensorRT 格式与 TensorRT-LLM 推理
TensorRT 是 NVIDIA 的高性能深度学习推理 SDK,可以对模型进行层融合、精度校准等深度优化。TensorRT-LLM 是专门针对 LLM 的优化版本。
# 文件名:tensorrt_deployment.py
# 需要安装: pip install tensorrt tensorrt-llm (需要 NVIDIA GPU)
# ============================================================
# 使用 Optimum 导出 TensorRT(简化流程)
# ============================================================
from optimum.nvidia import TensorRTModel
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
# 导出并优化为 TensorRT 引擎
trt_model = TensorRTModel.from_pretrained(
model_name,
export=True, # 自动导出
optimize=True, # 优化
fp16=True, # 使用 FP16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 推理
inputs = tokenizer("TensorRT is fast!", return_tensors="pt")
outputs = trt_model(**inputs)
print(f"TensorRT 输出: {outputs.logits}")
# ============================================================
# TensorRT-LLM 用于大语言模型(概念示例)
# ============================================================
"""
# TensorRT-LLM 需要编译模型,流程较复杂,参考官方文档
# 基本步骤:
# 1. 克隆 TensorRT-LLM 仓库
# 2. 使用 build.py 脚本编译模型
# 3. 加载引擎进行推理
from tensorrt_llm.runtime import ModelRunner
runner = ModelRunner.from_dir("./trt_engine")
outputs = runner.generate(["Hello world"], max_new_tokens=100)
"""
print("TensorRT 适用于对延迟要求极高的场景,例如实时语音助手。")
16.3 使用 FastAPI 封装模型推理服务
将模型封装为 REST API 是实现生产部署的标准做法。FastAPI 是一个高性能的 Python Web 框架,非常适合模型服务。
# 文件名:fastapi_deployment.py
# 运行方式: uvicorn fastapi_deployment:app --reload --port 8000
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from typing import List, Optional
import time
# ============================================================
# 定义请求和响应模型
# ============================================================
class SentimentRequest(BaseModel):
text: str
max_length: Optional[int] = 512
class SentimentResponse(BaseModel):
label: str
confidence: float
processing_time_ms: float
# ============================================================
# 加载模型(启动时加载一次)
# ============================================================
app = FastAPI(title="Sentiment Analysis API", description="Hugging Face Transformers 情感分析服务")
MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"正在加载模型 {MODEL_NAME} 到 {device}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to(device)
model.eval()
print("模型加载完成")
# ============================================================
# API 端点
# ============================================================
@app.get("/health")
async def health_check():
return {"status": "healthy"}
@app.post("/predict", response_model=SentimentResponse)
async def predict(request: SentimentRequest):
start_time = time.time()
try:
# 分词
inputs = tokenizer(
request.text,
truncation=True,
max_length=request.max_length,
return_tensors="pt"
).to(device)
# 推理
with torch.no_grad():
outputs = model(**inputs)
# 后处理
probs = torch.softmax(outputs.logits, dim=-1)
confidence, pred_id = torch.max(probs, dim=-1)
label = model.config.id2label[pred_id.item()]
processing_time = (time.time() - start_time) * 1000
return SentimentResponse(
label=label,
confidence=confidence.item(),
processing_time_ms=round(processing_time, 2)
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# ============================================================
# 批量预测端点
# ============================================================
class BatchSentimentRequest(BaseModel):
texts: List[str]
max_length: Optional[int] = 512
@app.post("/predict_batch")
async def predict_batch(request: BatchSentimentRequest):
start_time = time.time()
try:
inputs = tokenizer(
request.texts,
truncation=True,
padding=True,
max_length=request.max_length,
return_tensors="pt"
).to(device)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
confidences, pred_ids = torch.max(probs, dim=-1)
labels = [model.config.id2label[i.item()] for i in pred_ids]
processing_time = (time.time() - start_time) * 1000
return {
"predictions": [
{"label": lbl, "confidence": conf.item()}
for lbl, conf in zip(labels, confidences)
],
"total_processing_time_ms": round(processing_time, 2),
"batch_size": len(request.texts)
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# 启动命令:
# uvicorn fastapi_deployment:app --host 0.0.0.0 --port 8000 --workers 1
# 注意: 对于模型服务,workers 通常设为 1 以避免多进程重复加载模型
生产环境改进建议:
- 添加请求队列和限流(使用
slowapi或redis) - 使用
asyncio.to_thread将模型推理放到线程池,避免阻塞事件循环 - 启用 gunicorn + uvicorn 的多 worker,但需要每个 worker 独立加载模型(共享内存不可行)
16.4 使用 Text Generation Inference(TGI)部署 LLM
TGI 是 Hugging Face 官方推出的 LLM 推理服务框架,支持连续批处理、Flash Attention、Paged Attention 等优化。部署最简单的方式是使用 Docker。
# 文件名:tgi_deployment_guide.py
"""
TGI 部署步骤(概念说明):
1. 拉取 TGI 镜像
docker pull ghcr.io/huggingface/text-generation-inference:latest
2. 启动服务(单 GPU)
docker run --gpus all -p 8080:80 \
-v /path/to/models:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id /data/Qwen/Qwen2.5-7B \
--max-total-tokens 4096 \
--max-batch-total-tokens 8192 \
--max-concurrent-requests 128
3. 使用 Python 客户端调用
"""
import requests
import json
# TGI API 端点
TGI_URL = "http://localhost:8080/generate"
def tgi_generate(prompt, max_new_tokens=200, temperature=0.7, top_p=0.9):
payload = {
"inputs": prompt,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"top_p": top_p,
"do_sample": True,
"return_full_text": False,
}
}
response = requests.post(TGI_URL, json=payload)
if response.status_code == 200:
return response.json()["generated_text"]
else:
raise Exception(f"TGI error: {response.text}")
# 使用流式响应(SSE)
def tgi_generate_stream(prompt, max_new_tokens=200):
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": max_new_tokens, "stream": True}
}
response = requests.post(TGI_URL, json=payload, stream=True)
for line in response.iter_lines():
if line:
yield json.loads(line.decode('utf-8'))["token"]["text"]
# 示例
if __name__ == "__main__":
prompt = "Write a short story about a robot learning to paint."
result = tgi_generate(prompt, max_new_tokens=300)
print(result)
TGI 还提供了 OpenAI 兼容的 API 端点,可以直接替换 OpenAI API 客户端。
16.5 边缘端部署:Transformers.js 与 Web 推理
Transformers.js 是 Hugging Face 推出的 JavaScript 库,允许在浏览器中直接运行预训练模型,无需后端服务器。
<!-- 文件名:transformersjs_demo.html -->
<!-- 在浏览器中运行 Transformers 模型 -->
<!DOCTYPE html>
<html>
<head>
<title>Transformers.js 情感分析</title>
<script type="importmap">
{
"imports": {
"@huggingface/transformers": "https://unpkg.com/@huggingface/transformers@2.8.0/dist/transformers.min.js"
}
}
</script>
</head>
<body>
<h1>Transformers.js 情感分析</h1>
<textarea id="input" rows="4" cols="50">I love Hugging Face Transformers.js!</textarea>
<button id="analyze">分析情感</button>
<div id="result"></div>
<script type="module">
import { pipeline } from '@huggingface/transformers';
// 加载模型(首次运行时会下载,后续使用缓存)
const classifier = await pipeline('sentiment-analysis', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english');
document.getElementById('analyze').onclick = async () => {
const text = document.getElementById('input').value;
const result = await classifier(text);
document.getElementById('result').innerHTML =
`<p>情感: ${result[0].label}</p>
<p>置信度: ${(result[0].score * 100).toFixed(2)}%</p>`;
};
</script>
</body>
</html>
边缘部署的优势与局限:
- 优点:无服务器成本、低延迟、用户数据隐私保护
- 局限:模型大小受限(通常 < 500MB),推理速度受客户端设备性能影响
第六部分:多模态与进阶应用篇
17. 多模态模型
17.1 视觉模型(ViT, BEiT)的使用与微调
Hugging Face Transformers 不仅支持 NLP,还支持计算机视觉模型。Vision Transformer (ViT) 将图像切分为 patches,然后像处理文本 token 一样处理这些 patches。
# 文件名:vit_image_classification.py
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests
import torch
# 1. 加载预训练的 ViT 模型(在 ImageNet-21k 上预训练,ImageNet-1k 上微调)
model_name = "google/vit-base-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name)
model = ViTForImageClassification.from_pretrained(model_name)
# 2. 加载图像(示例:从 URL)
url = "http://images.cocodataset.org/val2017/000000039769.jpg" # 两只猫
image = Image.open(requests.get(url, stream=True).raw)
# 3. 预处理
inputs = processor(images=image, return_tensors="pt")
# 4. 推理
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# 5. 获取预测类别
predicted_class_idx = logits.argmax(-1).item()
print(f"预测类别索引: {predicted_class_idx}")
print(f"预测标签: {model.config.id2label[predicted_class_idx]}")
# ============================================================
# 微调 ViT 用于自定义分类任务
# ============================================================
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
# 加载一个自定义图像数据集(示例使用 cifar10,实际应替换为你的数据)
dataset = load_dataset("cifar10", split="train[:100]")
dataset = dataset.cast_column("img", Image.Image) # 确保为 PIL Image
# 预处理函数
def preprocess(example):
image = example["img"]
# 调整大小并转换为模型输入
inputs = processor(images=image, return_tensors="pt")
return {"pixel_values": inputs["pixel_values"].squeeze(), "labels": example["label"]}
processed_dataset = dataset.map(preprocess, remove_columns=["img"])
# 修改分类头(CIFAR-10 有 10 个类别)
model = ViTForImageClassification.from_pretrained(model_name, num_labels=10, ignore_mismatched_sizes=True)
training_args = TrainingArguments(
output_dir="./vit-cifar10",
per_device_train_batch_size=4,
num_train_epochs=3,
evaluation_strategy="no",
save_strategy="no",
logging_steps=10,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=processed_dataset,
)
# trainer.train()
print("微调准备就绪,取消注释以开始训练。")
17.2 视觉-语言模型(CLIP, BLIP, LLaVA)的加载与推理
视觉-语言模型能够同时理解图像和文本,支持图像-文本检索、图像描述生成、视觉问答等任务。
# 文件名:vlm_demo.py
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests
import torch
# ============================================================
# CLIP: 图像-文本匹配
# ============================================================
print("=== CLIP 图像-文本相似度 ===")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
texts = ["a photo of a cat", "a photo of a dog", "a photo of two cats"]
inputs = clip_processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = clip_model(**inputs)
logits_per_image = outputs.logits_per_image # 图像-文本相似度分数
probs = logits_per_image.softmax(dim=-1)
for text, prob in zip(texts, probs[0]):
print(f"'{text}': {prob.item():.4f}")
# ============================================================
# BLIP: 图像描述生成
# ============================================================
print("\n=== BLIP 图像描述 ===")
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
inputs = blip_processor(images=image, return_tensors="pt")
with torch.no_grad():
out = blip_model.generate(**inputs, max_length=50)
caption = blip_processor.decode(out[0], skip_special_tokens=True)
print(f"描述: {caption}")
# ============================================================
# LLaVA: 视觉问答(需要安装 transformers 和 accelerate)
# ============================================================
print("\n=== LLaVA 视觉问答 ===")
from transformers import LlavaProcessor, LlavaForConditionalGeneration
# 注意:LLaVA 模型较大,需要足够显存
llava_model_id = "llava-hf/llava-1.5-7b-hf"
processor = LlavaProcessor.from_pretrained(llava_model_id)
llava_model = LlavaForConditionalGeneration.from_pretrained(llava_model_id, torch_dtype=torch.float16, device_map="auto")
# 对话格式
conversation = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "What are these animals doing?"}
]}
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(llava_model.device)
with torch.no_grad():
output = llava_model.generate(**inputs, max_new_tokens=100)
response = processor.decode(output[0], skip_special_tokens=True)
print(f"回答: {response}")
17.3 音频模型(Whisper, Wav2Vec2)的使用
Transformers 也支持音频处理模型,如 Whisper(语音识别)和 Wav2Vec2(语音表示学习)。
# 文件名:audio_models_demo.py
from transformers import WhisperProcessor, WhisperForConditionalGeneration, Wav2Vec2Processor, Wav2Vec2ForCTC
import torchaudio
import requests
import torch
# ============================================================
# Whisper: 语音识别(ASR)
# ============================================================
print("=== Whisper 语音识别 ===")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
# 下载示例音频(LibriSpeech 样本)
url = "https://huggingface.co/datasets/librispeech_asr/resolve/main/clean/1/1/1/1-1-1.flac"
audio_path = "sample.flac"
response = requests.get(url)
with open(audio_path, "wb") as f:
f.write(response.content)
# 加载音频
speech_array, sampling_rate = torchaudio.load(audio_path)
speech_array = speech_array.squeeze().numpy()
# 预处理
inputs = whisper_processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt")
# 推理
with torch.no_grad():
predicted_ids = whisper_model.generate(**inputs)
transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"转录文本: {transcription}")
# ============================================================
# Wav2Vec2: 语音识别(CTC)
# ============================================================
print("\n=== Wav2Vec2 语音识别 ===")
wav2vec_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
wav2vec_processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
inputs = wav2vec_processor(speech_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = wav2vec_model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = wav2vec_processor.decode(predicted_ids[0])
print(f"Wav2Vec2 转录: {transcription}")
# 清理
import os
os.remove(audio_path)
18. 自定义模型与扩展
18.1 理解 Transformers 中的模型注册机制
Transformers 使用自动注册机制,使得可以通过字符串名称加载任意支持的模型。核心是 AutoModel 和 CONFIG_MAPPING。
# 文件名:model_registry_demo.py
from transformers import AutoConfig, AutoModel, CONFIG_MAPPING, MODEL_MAPPING
# 1. 查看注册的配置和模型
print("已注册的配置类型数量:", len(CONFIG_MAPPING))
print("已注册的模型类型数量:", len(MODEL_MAPPING))
# 2. 手动通过配置类加载模型
config = AutoConfig.from_pretrained("bert-base-uncased")
model = AutoModel.from_config(config) # 随机初始化,不是预训练权重
# 3. 查看模型的默认架构
print(f"BERT 的模型类: {type(model)}")
# 4. 自定义配置类的注册机制
from transformers import PretrainedConfig
class MyCustomConfig(PretrainedConfig):
model_type = "my_custom_model"
def __init__(self, hidden_size=768, num_layers=12, **kwargs):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.num_layers = num_layers
# 注册配置(让 AutoConfig 能够识别)
from transformers import CONFIG_MAPPING
CONFIG_MAPPING.register("my_custom_model", MyCustomConfig)
print("自定义配置已注册")
18.2 在 Transformers 中添加自定义模型架构
添加自定义模型需要三个步骤:定义配置、定义模型、注册到 Auto 类。
# 文件名:add_custom_model.py
from transformers import PreTrainedModel, PretrainedConfig
import torch.nn as nn
import torch
# 1. 定义配置
class CustomConfig(PretrainedConfig):
model_type = "custom_tiny_model"
def __init__(self, vocab_size=30522, hidden_size=128, num_hidden_layers=2, num_attention_heads=2, **kwargs):
super().__init__(**kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
# 2. 定义模型(简化的 Transformer)
class CustomModel(PreTrainedModel):
config_class = CustomConfig
def __init__(self, config):
super().__init__(config)
self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)
# 简化:使用单层自注意力
self.attention = nn.MultiheadAttention(config.hidden_size, config.num_attention_heads, batch_first=True)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size)
self.init_weights()
def forward(self, input_ids, attention_mask=None):
x = self.embedding(input_ids)
# 自注意力
attn_output, _ = self.attention(x, x, x, key_padding_mask=attention_mask)
logits = self.lm_head(attn_output)
return logits
def _init_weights(self, module):
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
# 3. 注册到 Auto 类
from transformers import AutoConfig, AutoModel
AutoConfig.register("custom_tiny_model", CustomConfig)
AutoModel.register(CustomConfig, CustomModel)
# 4. 使用自定义模型
config = CustomConfig(vocab_size=1000, hidden_size=64)
model = AutoModel.from_config(config)
print("自定义模型已注册并实例化")
print(f"模型结构: {model}")
18.3 使用 PreTrainedModel 基类开发新模型
PreTrainedModel 提供了权重保存、加载、设备映射等基础设施,开发新模型时应继承它。
# 文件名:pretrained_model_inheritance.py
from transformers import PreTrainedModel, PretrainedConfig
import torch.nn as nn
import torch
class MyTransformerConfig(PretrainedConfig):
model_type = "my_transformer"
def __init__(self, d_model=256, nhead=8, num_layers=6, vocab_size=30000, **kwargs):
super().__init__(**kwargs)
self.d_model = d_model
self.nhead = nhead
self.num_layers = num_layers
self.vocab_size = vocab_size
class MyTransformerModel(PreTrainedModel):
config_class = MyTransformerConfig
base_model_prefix = "transformer"
def __init__(self, config):
super().__init__(config)
self.embedding = nn.Embedding(config.vocab_size, config.d_model)
encoder_layer = nn.TransformerEncoderLayer(d_model=config.d_model, nhead=config.nhead, batch_first=True)
self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=config.num_layers)
self.lm_head = nn.Linear(config.d_model, config.vocab_size)
# 初始化权重(可选)
self.post_init()
def forward(self, input_ids, attention_mask=None):
x = self.embedding(input_ids)
# Transformer 编码器期望的 mask 形状: (seq_len, batch_size) 或 (batch_size, seq_len)
if attention_mask is not None:
# 转换 mask 形状
attention_mask = attention_mask == 0 # True 表示需要屏蔽的位置
x = self.encoder(x, src_key_padding_mask=attention_mask)
logits = self.lm_head(x)
return logits
def _init_weights(self, module):
"""初始化权重(PreTrainedModel 会调用此方法)"""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
# 使用自定义模型
config = MyTransformerConfig(d_model=128, nhead=4, num_layers=2, vocab_size=5000)
model = MyTransformerModel(config)
# 保存和加载
model.save_pretrained("./my_transformer")
loaded_model = MyTransformerModel.from_pretrained("./my_transformer")
print("自定义 PreTrainedModel 已保存和加载")
18.4 与 PyTorch Lightning、Keras 等框架的集成
Transformers 模型可以无缝集成到其他训练框架中。
# 文件名:integration_with_lightning.py
import pytorch_lightning as pl
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.utils.data import DataLoader, Dataset
import torch
# ============================================================
# 与 PyTorch Lightning 集成
# ============================================================
class LightningTransformer(pl.LightningModule):
def __init__(self, model_name, num_labels=2):
super().__init__()
self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
def forward(self, input_ids, attention_mask):
return self.model(input_ids, attention_mask=attention_mask)
def training_step(self, batch, batch_idx):
outputs = self(**batch)
loss = outputs.loss
self.log('train_loss', loss)
return loss
def validation_step(self, batch, batch_idx):
outputs = self(**batch)
loss = outputs.loss
self.log('val_loss', loss)
return loss
def configure_optimizers(self):
return torch.optim.AdamW(self.parameters(), lr=2e-5)
# 准备数据
class TextDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_length=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
truncation=True,
padding='max_length',
max_length=self.max_length,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'labels': torch.tensor(self.labels[idx])
}
# 示例数据
texts = ["Great movie!", "Terrible film.", "Awesome performance", "Waste of time"]
labels = [1, 0, 1, 0]
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = TextDataset(texts, labels, tokenizer)
dataloader = DataLoader(dataset, batch_size=2)
# 创建 Lightning 模型
model = LightningTransformer("bert-base-uncased")
# 训练(使用 Lightning Trainer)
trainer = pl.Trainer(max_epochs=1, accelerator="auto", devices=1, fast_dev_run=True)
# trainer.fit(model, dataloader) # 取消注释运行
print("PyTorch Lightning 集成示例完成")
# ============================================================
# 与 TensorFlow/Keras 集成
# ============================================================
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification
# TensorFlow 版本的模型
tf_model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# 编译
tf_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
# 准备 TF 数据集(示例)
# def tokenize_for_tf(texts, tokenizer):
# return tokenizer(texts, padding=True, truncation=True, return_tensors='tf')
# tf_dataset = tf.data.Dataset.from_tensor_slices((texts, labels)).map(...)
# tf_model.fit(tf_dataset, epochs=3)
print("TensorFlow/Keras 集成示例完成")
🌟 感谢您耐心阅读到这里!
💡 如果本文对您有所启发欢迎:
👍 点赞📌 收藏 📤 分享给更多需要的伙伴。
🗣️ 期待在评论区看到您的想法, 共同进步。
🔔 关注我,持续获取更多干货内容~
🤗 我们下篇文章见~
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐



所有评论(0)