在这里插入图片描述

【Hugging Face Transformers 工具从入门到精通】全文导读

第一部分:基础入门篇

第二部分:核心API深度实践篇

第三部分:数据处理与训练篇

第四部分:参数高效微调(PEFT)与高级训练篇

第五部分:推理优化与部署篇&第六部分:多模态与进阶应用篇

第七部分:案例实战篇

第八部分:常见问题与调试指南


第五部分:推理优化与部署篇

15. 模型推理优化

15.1 降低精度推理:torch.float16 / bfloat16

在推理阶段,模型的前向传播对数值精度要求低于训练阶段。将模型权重从默认的 float32(32位浮点数)降低到 float16(16位浮点数)或 bfloat16(Brain Float 16),可以将模型显存占用减半,同时推理速度提升 1.5-2 倍。

float16bfloat16 的区别在于:float16 牺牲了数值范围(约 6e-8 到 65504),在数值极值时可能发生溢出;bfloat16 保留了与 float32 相同的指数位(8位),因此数值范围一致,但尾数位更少,适合大多数深度学习场景。

# 文件名:half_precision_inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

model_name = "Qwen/Qwen2.5-0.5B"

# 1. FP32 推理(默认)
print("=== FP32 推理 ===")
model_fp32 = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. FP16 推理
print("\n=== FP16 推理 ===")
model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # 指定为 FP16
    device_map="auto"
)

# 3. BF16 推理(需要 Ampere 架构或更高 GPU)
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8:
    print("\n=== BF16 推理 ===")
    model_bf16 = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )

# 性能对比
def benchmark(model, tokenizer, prompt, device, num_runs=10):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    model.eval()
    
    # 预热
    with torch.no_grad():
        _ = model.generate(**inputs, max_new_tokens=10)
    
    # 计时
    start = time.time()
    with torch.no_grad():
        for _ in range(num_runs):
            _ = model.generate(**inputs, max_new_tokens=20)
    elapsed = time.time() - start
    
    return elapsed / num_runs

prompt = "The future of artificial intelligence is"
device = "cuda" if torch.cuda.is_available() else "cpu"

# 注意:FP32 模型可能很大,此处仅演示概念
# 实际使用时,建议对同一模型在不同精度下测试

print("\n精度选择建议:")
print("  - float16: 通用选择,速度和内存优化明显")
print("  - bfloat16: 需要 Ampere GPU (A100, RTX 3090/4090 等),数值稳定性更好")
print("  - float32: 仅在精度敏感场景(如数学推理)使用")

注意事项

  • 将模型移动到 GPU 时,确保权重与设备匹配:model.to(device) 会在内部处理类型转换。
  • 生成任务中,logits 的计算会自动继承模型的 dtype
15.2 模型量化:bitsandbytes 8-bit / 4-bit 量化

量化是一种更激进的精度降低方法。与简单降低精度不同,量化通常涉及额外的缩放和零点调整,以在低位宽下保持更多的信息。

bitsandbytes 库提供了高效的 8-bit 和 4-bit 量化实现,特别适合在消费级 GPU 上运行大模型。

# 文件名:quantization_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "Qwen/Qwen2.5-7B"  # 7B 模型,FP16 约 14GB

# 1. 8-bit 量化配置
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,      # 异常值阈值,超过此值的维度不量化
    llm_int8_skip_modules=None,   # 跳过量化的模块
)

# 2. 4-bit 量化配置(更激进,显存占用更低)
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

print("=== 加载 8-bit 量化模型 ===")
model_8bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config_8bit,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"8-bit 模型显存占用: {model_8bit.get_memory_footprint() / 1e9:.2f} GB")

# 推理测试
inputs = tokenizer("Explain quantum computing in simple terms:", return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model_8bit.generate(**inputs, max_new_tokens=100, do_sample=True)
print("生成结果:", tokenizer.decode(outputs[0], skip_special_tokens=True))

# 清理显存
del model_8bit
torch.cuda.empty_cache()

print("\n=== 加载 4-bit 量化模型 ===")
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config_4bit,
    device_map="auto",
)
print(f"4-bit 模型显存占用: {model_4bit.get_memory_footprint() / 1e9:.2f} GB")

量化精度与性能权衡

精度 显存占用 (7B) 推理速度 质量损失
FP16 ~14 GB 基准
8-bit ~8 GB ~90% <1%
4-bit ~4 GB ~85% 1-3%

对于大多数应用,4-bit 量化是一个极佳的选择,可以在单张 6GB 显存的 GPU 上运行 7B 模型。

15.3 批处理推理与动态批处理策略

批处理是提升推理吞吐量的最有效手段之一。通过将多个请求合并为一个批次,可以充分利用 GPU 的并行计算能力。

# 文件名:batch_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import torch

model_name = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# 准备多个输入
prompts = [
    "What is the capital of France?",
    "Explain the theory of relativity.",
    "Write a haiku about spring.",
    "What are the benefits of exercise?",
] * 10  # 共 40 个请求

# 1. 串行推理
print("=== 串行推理 ===")
start = time.time()
single_results = []
for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=50)
    single_results.append(tokenizer.decode(outputs[0], skip_special_tokens=True))
serial_time = time.time() - start
print(f"串行耗时: {serial_time:.2f} 秒")

# 2. 静态批处理(所有序列填充到相同长度)
print("\n=== 静态批处理 ===")
# 首先对每个 prompt 进行 tokenization,并记录长度
encodings = [tokenizer(p, return_tensors="pt") for p in prompts]
# 计算最大长度
max_len = max(enc['input_ids'].shape[1] for enc in encodings)
print(f"最大序列长度: {max_len}")

# 使用 padding 和 truncation 批量编码
batch_inputs = tokenizer(
    prompts,
    padding=True,           # 填充到批次内最大长度
    truncation=True,
    max_length=512,
    return_tensors="pt"
).to(model.device)

start = time.time()
with torch.no_grad():
    batch_outputs = model.generate(
        **batch_inputs,
        max_new_tokens=50,
        pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id
    )
batch_time = time.time() - start
print(f"批处理耗时: {batch_time:.2f} 秒")
print(f"加速比: {serial_time / batch_time:.2f}x")

# 3. 动态批处理(使用简单缓存策略模拟)
class DynamicBatcher:
    """简单的动态批处理器,积累请求直到达到 batch_size"""
    def __init__(self, model, tokenizer, max_batch_size=8, max_wait_time=0.1):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.max_wait_time = max_wait_time
        self.queue = []
    
    def add_request(self, prompt, callback):
        self.queue.append((prompt, callback))
        if len(self.queue) >= self.max_batch_size:
            self.process_batch()
    
    def process_batch(self):
        if not self.queue:
            return
        batch = self.queue[:self.max_batch_size]
        self.queue = self.queue[self.max_batch_size:]
        
        prompts = [item[0] for item in batch]
        callbacks = [item[1] for item in batch]
        
        inputs = self.tokenizer(prompts, padding=True, truncation=True, return_tensors="pt").to(self.model.device)
        with torch.no_grad():
            outputs = self.model.generate(**inputs, max_new_tokens=50)
        
        decoded = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
        for cb, text in zip(callbacks, decoded):
            cb(text)

print("\n动态批处理在生产环境中通常与消息队列结合使用,此处仅示意。")

批处理的最佳实践

  • 对于 LLM 生成任务,注意不同输入的输出长度差异很大,使用 pad_token_id 正确填充。
  • 批处理大小受限于 GPU 显存,建议从 4 开始逐步增加。
  • 对于变长序列,考虑使用 padding='longest' 而非固定 max_length,以节省计算。
15.4 模型编译优化:torch.compile 与 Transformers

PyTorch 2.0 引入了 torch.compile,通过将模型计算图编译为优化的内核代码,可以显著提升推理速度,尤其适合 transformer 类模型。

# 文件名:torch_compile_demo.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

model_name = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to("cuda")

# 1. 未编译的模型
def inference_uncompiled(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# 2. 使用 torch.compile 编译模型
# 注意:compile 对生成任务的优化有限,主要优化前向传播
# 对于纯前向推理(如分类),效果更显著
compiled_model = torch.compile(model, mode="reduce-overhead")  # 模式: default, reduce-overhead, max-autotune

def inference_compiled(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = compiled_model.generate(**inputs, max_new_tokens=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# 预热
print("预热中...")
for _ in range(3):
    _ = inference_uncompiled("Hello")
    _ = inference_compiled("Hello")

# 性能测试
prompt = "Explain the importance of recycling."
num_runs = 10

start = time.time()
for _ in range(num_runs):
    _ = inference_uncompiled(prompt)
uncompiled_time = time.time() - start

start = time.time()
for _ in range(num_runs):
    _ = inference_compiled(prompt)
compiled_time = time.time() - start

print(f"\n未编译平均耗时: {uncompiled_time/num_runs*1000:.2f} ms")
print(f"编译后平均耗时: {compiled_time/num_runs*1000:.2f} ms")
print(f"加速比: {uncompiled_time/compiled_time:.2f}x")

# 注意事项
print("\n注意事项:")
print("1. 首次编译需要时间(数十秒),适合长期运行的部署场景")
print("2. 编译对动态形状(变长序列)效果有限")
print("3. 某些模型或操作可能不兼容,需测试")

torch.compile 的三个模式:

  • "default":适合大多数模型,编译速度适中。
  • "reduce-overhead":减少 Python 开销,适合小模型或频繁调用的场景。
  • "max-autotune":尝试多种内核优化,编译时间最长,但性能最佳。
15.5 高效推理引擎对比:vLLM、TGI、llama.cpp

当需要生产级 LLM 推理服务时,专用推理引擎通常比原生 Transformers 性能更好。以下是三个主流引擎的对比:

特性 vLLM TGI (Text Generation Inference) llama.cpp
核心优化 PagedAttention Flash Attention + 连续批处理 4-bit 量化 + CPU/GPU 混合
吞吐量提升 10-20x 5-10x 2-5x
GPU 支持 最佳 最佳 支持但有限
CPU 推理 不支持 不支持 最佳
量化支持 GPTQ, AWQ GPTQ, AWQ GGUF
部署难度 中等 中等 简单
API 兼容性 OpenAI API OpenAI API 自定义
# 文件名:inference_engines_comparison.py
# 本文件展示如何使用 vLLM 和 TGI,需要额外安装

# ============================================================
# vLLM 示例(需要安装:pip install vllm)
# ============================================================
"""
from vllm import LLM, SamplingParams

# 加载模型
llm = LLM(model="Qwen/Qwen2.5-7B")

# 配置采样参数
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)

# 批量推理
prompts = ["Hello, how are you?", "What is the capital of France?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}\n")
"""

# ============================================================
# TGI 部署(需要 Docker)
# ============================================================
"""
# 启动 TGI 服务
docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id Qwen/Qwen2.5-7B \
  --max-total-tokens 4096

# 调用 API
import requests

response = requests.post(
    "http://localhost:8080/generate",
    json={
        "inputs": "Explain quantum computing.",
        "parameters": {"max_new_tokens": 200, "temperature": 0.7}
    }
)
print(response.json()["generated_text"])
"""

# ============================================================
# llama.cpp 示例(适合 CPU 或混合推理)
# ============================================================
"""
# 1. 转换模型为 GGUF 格式(使用 llama.cpp 的 convert.py)
# 2. 使用 llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="./qwen-7b-q4_K_M.gguf", n_ctx=4096, n_gpu_layers=35)  # 将部分层放到 GPU
output = llm("Q: What is AI? A:", max_tokens=200, stop=["Q:", "\n"], echo=False)
print(output["choices"][0]["text"])
"""

print("推荐选择:")
print("  - 高吞吐 GPU 服务(多用户): vLLM")
print("  - Hugging Face 官方部署: TGI")
print("  - 消费级 GPU 或纯 CPU 推理: llama.cpp")

16. 模型导出与服务化部署

16.1 导出为 ONNX 格式与 ONNX Runtime 推理

ONNX(Open Neural Network Exchange)是一种开放的模型格式,允许模型在不同框架和硬件上运行。ONNX Runtime 是微软开发的高性能推理引擎。

# 文件名:onnx_export_inference.py
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# ============================================================
# 方法1:使用 Optimum 库导出 ONNX(推荐)
# ============================================================
from optimum.onnxruntime import ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoOptimizationConfig

# 导出模型
ort_model = ORTModelForSequenceClassification.from_pretrained(
    model_name,
    export=True,                           # 自动导出为 ONNX
    provider="CPUExecutionProvider",       # 推理后端
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# 使用 ONNX Runtime 推理
inputs = tokenizer("I love Hugging Face!", return_tensors="pt")
outputs = ort_model(**inputs)
print(f"ONNX Runtime 输出: {outputs.logits}")

# 保存 ONNX 模型
ort_model.save_pretrained("./onnx_model")
tokenizer.save_pretrained("./onnx_model")

# ============================================================
# 方法2:使用 torch.onnx 手动导出
# ============================================================
import torch.onnx

model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# 创建示例输入
dummy_input = tokenizer("Example text", return_tensors="pt")
input_names = ["input_ids", "attention_mask"]
output_names = ["logits"]

torch.onnx.export(
    model,
    (dummy_input["input_ids"], dummy_input["attention_mask"]),
    "model.onnx",
    input_names=input_names,
    output_names=output_names,
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "sequence_length"},
        "attention_mask": {0: "batch_size", 1: "sequence_length"},
        "logits": {0: "batch_size"}
    },
    opset_version=14,
)
print("手动导出完成: model.onnx")

# ============================================================
# 使用 ONNX Runtime 进行推理(不依赖 Transformers)
# ============================================================
import onnxruntime as ort

# 加载 ONNX 模型
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

# 准备输入
input_ids = dummy_input["input_ids"].numpy()
attention_mask = dummy_input["attention_mask"].numpy()

# 推理
outputs = session.run(["logits"], {"input_ids": input_ids, "attention_mask": attention_mask})
print(f"ONNX Runtime 原生推理: {outputs[0]}")

ONNX 的优势

  • 跨框架部署(PyTorch, TensorFlow 模型均可转 ONNX)
  • 硬件加速支持(CUDA, TensorRT, OpenVINO)
  • 减少推理依赖(不需要完整的 Transformers 库)
16.2 导出为 TensorRT 格式与 TensorRT-LLM 推理

TensorRT 是 NVIDIA 的高性能深度学习推理 SDK,可以对模型进行层融合、精度校准等深度优化。TensorRT-LLM 是专门针对 LLM 的优化版本。

# 文件名:tensorrt_deployment.py
# 需要安装: pip install tensorrt tensorrt-llm (需要 NVIDIA GPU)

# ============================================================
# 使用 Optimum 导出 TensorRT(简化流程)
# ============================================================
from optimum.nvidia import TensorRTModel

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# 导出并优化为 TensorRT 引擎
trt_model = TensorRTModel.from_pretrained(
    model_name,
    export=True,                    # 自动导出
    optimize=True,                  # 优化
    fp16=True,                      # 使用 FP16
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# 推理
inputs = tokenizer("TensorRT is fast!", return_tensors="pt")
outputs = trt_model(**inputs)
print(f"TensorRT 输出: {outputs.logits}")

# ============================================================
# TensorRT-LLM 用于大语言模型(概念示例)
# ============================================================
"""
# TensorRT-LLM 需要编译模型,流程较复杂,参考官方文档
# 基本步骤:
# 1. 克隆 TensorRT-LLM 仓库
# 2. 使用 build.py 脚本编译模型
# 3. 加载引擎进行推理

from tensorrt_llm.runtime import ModelRunner

runner = ModelRunner.from_dir("./trt_engine")
outputs = runner.generate(["Hello world"], max_new_tokens=100)
"""

print("TensorRT 适用于对延迟要求极高的场景,例如实时语音助手。")
16.3 使用 FastAPI 封装模型推理服务

将模型封装为 REST API 是实现生产部署的标准做法。FastAPI 是一个高性能的 Python Web 框架,非常适合模型服务。

# 文件名:fastapi_deployment.py
# 运行方式: uvicorn fastapi_deployment:app --reload --port 8000

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from typing import List, Optional
import time

# ============================================================
# 定义请求和响应模型
# ============================================================
class SentimentRequest(BaseModel):
    text: str
    max_length: Optional[int] = 512

class SentimentResponse(BaseModel):
    label: str
    confidence: float
    processing_time_ms: float

# ============================================================
# 加载模型(启动时加载一次)
# ============================================================
app = FastAPI(title="Sentiment Analysis API", description="Hugging Face Transformers 情感分析服务")

MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"正在加载模型 {MODEL_NAME}{device}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to(device)
model.eval()
print("模型加载完成")

# ============================================================
# API 端点
# ============================================================
@app.get("/health")
async def health_check():
    return {"status": "healthy"}

@app.post("/predict", response_model=SentimentResponse)
async def predict(request: SentimentRequest):
    start_time = time.time()
    
    try:
        # 分词
        inputs = tokenizer(
            request.text,
            truncation=True,
            max_length=request.max_length,
            return_tensors="pt"
        ).to(device)
        
        # 推理
        with torch.no_grad():
            outputs = model(**inputs)
        
        # 后处理
        probs = torch.softmax(outputs.logits, dim=-1)
        confidence, pred_id = torch.max(probs, dim=-1)
        label = model.config.id2label[pred_id.item()]
        
        processing_time = (time.time() - start_time) * 1000
        
        return SentimentResponse(
            label=label,
            confidence=confidence.item(),
            processing_time_ms=round(processing_time, 2)
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# ============================================================
# 批量预测端点
# ============================================================
class BatchSentimentRequest(BaseModel):
    texts: List[str]
    max_length: Optional[int] = 512

@app.post("/predict_batch")
async def predict_batch(request: BatchSentimentRequest):
    start_time = time.time()
    
    try:
        inputs = tokenizer(
            request.texts,
            truncation=True,
            padding=True,
            max_length=request.max_length,
            return_tensors="pt"
        ).to(device)
        
        with torch.no_grad():
            outputs = model(**inputs)
        
        probs = torch.softmax(outputs.logits, dim=-1)
        confidences, pred_ids = torch.max(probs, dim=-1)
        labels = [model.config.id2label[i.item()] for i in pred_ids]
        
        processing_time = (time.time() - start_time) * 1000
        
        return {
            "predictions": [
                {"label": lbl, "confidence": conf.item()}
                for lbl, conf in zip(labels, confidences)
            ],
            "total_processing_time_ms": round(processing_time, 2),
            "batch_size": len(request.texts)
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# 启动命令:
# uvicorn fastapi_deployment:app --host 0.0.0.0 --port 8000 --workers 1
# 注意: 对于模型服务,workers 通常设为 1 以避免多进程重复加载模型

生产环境改进建议

  • 添加请求队列和限流(使用 slowapiredis
  • 使用 asyncio.to_thread 将模型推理放到线程池,避免阻塞事件循环
  • 启用 gunicorn + uvicorn 的多 worker,但需要每个 worker 独立加载模型(共享内存不可行)
16.4 使用 Text Generation Inference(TGI)部署 LLM

TGI 是 Hugging Face 官方推出的 LLM 推理服务框架,支持连续批处理、Flash Attention、Paged Attention 等优化。部署最简单的方式是使用 Docker。

# 文件名:tgi_deployment_guide.py
"""
TGI 部署步骤(概念说明):

1. 拉取 TGI 镜像
   docker pull ghcr.io/huggingface/text-generation-inference:latest

2. 启动服务(单 GPU)
   docker run --gpus all -p 8080:80 \
     -v /path/to/models:/data \
     ghcr.io/huggingface/text-generation-inference:latest \
     --model-id /data/Qwen/Qwen2.5-7B \
     --max-total-tokens 4096 \
     --max-batch-total-tokens 8192 \
     --max-concurrent-requests 128

3. 使用 Python 客户端调用
"""
import requests
import json

# TGI API 端点
TGI_URL = "http://localhost:8080/generate"

def tgi_generate(prompt, max_new_tokens=200, temperature=0.7, top_p=0.9):
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": max_new_tokens,
            "temperature": temperature,
            "top_p": top_p,
            "do_sample": True,
            "return_full_text": False,
        }
    }
    response = requests.post(TGI_URL, json=payload)
    if response.status_code == 200:
        return response.json()["generated_text"]
    else:
        raise Exception(f"TGI error: {response.text}")

# 使用流式响应(SSE)
def tgi_generate_stream(prompt, max_new_tokens=200):
    payload = {
        "inputs": prompt,
        "parameters": {"max_new_tokens": max_new_tokens, "stream": True}
    }
    response = requests.post(TGI_URL, json=payload, stream=True)
    for line in response.iter_lines():
        if line:
            yield json.loads(line.decode('utf-8'))["token"]["text"]

# 示例
if __name__ == "__main__":
    prompt = "Write a short story about a robot learning to paint."
    result = tgi_generate(prompt, max_new_tokens=300)
    print(result)

TGI 还提供了 OpenAI 兼容的 API 端点,可以直接替换 OpenAI API 客户端。

16.5 边缘端部署:Transformers.js 与 Web 推理

Transformers.js 是 Hugging Face 推出的 JavaScript 库,允许在浏览器中直接运行预训练模型,无需后端服务器。

<!-- 文件名:transformersjs_demo.html -->
<!-- 在浏览器中运行 Transformers 模型 -->

<!DOCTYPE html>
<html>
<head>
    <title>Transformers.js 情感分析</title>
    <script type="importmap">
        {
            "imports": {
                "@huggingface/transformers": "https://unpkg.com/@huggingface/transformers@2.8.0/dist/transformers.min.js"
            }
        }
    </script>
</head>
<body>
    <h1>Transformers.js 情感分析</h1>
    <textarea id="input" rows="4" cols="50">I love Hugging Face Transformers.js!</textarea>
    <button id="analyze">分析情感</button>
    <div id="result"></div>

    <script type="module">
        import { pipeline } from '@huggingface/transformers';

        // 加载模型(首次运行时会下载,后续使用缓存)
        const classifier = await pipeline('sentiment-analysis', 'Xenova/distilbert-base-uncased-finetuned-sst-2-english');
        
        document.getElementById('analyze').onclick = async () => {
            const text = document.getElementById('input').value;
            const result = await classifier(text);
            document.getElementById('result').innerHTML = 
                `<p>情感: ${result[0].label}</p>
                 <p>置信度: ${(result[0].score * 100).toFixed(2)}%</p>`;
        };
    </script>
</body>
</html>

边缘部署的优势与局限

  • 优点:无服务器成本、低延迟、用户数据隐私保护
  • 局限:模型大小受限(通常 < 500MB),推理速度受客户端设备性能影响

第六部分:多模态与进阶应用篇

17. 多模态模型

17.1 视觉模型(ViT, BEiT)的使用与微调

Hugging Face Transformers 不仅支持 NLP,还支持计算机视觉模型。Vision Transformer (ViT) 将图像切分为 patches,然后像处理文本 token 一样处理这些 patches。

# 文件名:vit_image_classification.py
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests
import torch

# 1. 加载预训练的 ViT 模型(在 ImageNet-21k 上预训练,ImageNet-1k 上微调)
model_name = "google/vit-base-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name)
model = ViTForImageClassification.from_pretrained(model_name)

# 2. 加载图像(示例:从 URL)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"  # 两只猫
image = Image.open(requests.get(url, stream=True).raw)

# 3. 预处理
inputs = processor(images=image, return_tensors="pt")

# 4. 推理
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# 5. 获取预测类别
predicted_class_idx = logits.argmax(-1).item()
print(f"预测类别索引: {predicted_class_idx}")
print(f"预测标签: {model.config.id2label[predicted_class_idx]}")

# ============================================================
# 微调 ViT 用于自定义分类任务
# ============================================================
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# 加载一个自定义图像数据集(示例使用 cifar10,实际应替换为你的数据)
dataset = load_dataset("cifar10", split="train[:100]")
dataset = dataset.cast_column("img", Image.Image)  # 确保为 PIL Image

# 预处理函数
def preprocess(example):
    image = example["img"]
    # 调整大小并转换为模型输入
    inputs = processor(images=image, return_tensors="pt")
    return {"pixel_values": inputs["pixel_values"].squeeze(), "labels": example["label"]}

processed_dataset = dataset.map(preprocess, remove_columns=["img"])

# 修改分类头(CIFAR-10 有 10 个类别)
model = ViTForImageClassification.from_pretrained(model_name, num_labels=10, ignore_mismatched_sizes=True)

training_args = TrainingArguments(
    output_dir="./vit-cifar10",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    evaluation_strategy="no",
    save_strategy="no",
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset,
)

# trainer.train()
print("微调准备就绪,取消注释以开始训练。")
17.2 视觉-语言模型(CLIP, BLIP, LLaVA)的加载与推理

视觉-语言模型能够同时理解图像和文本,支持图像-文本检索、图像描述生成、视觉问答等任务。

# 文件名:vlm_demo.py
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests
import torch

# ============================================================
# CLIP: 图像-文本匹配
# ============================================================
print("=== CLIP 图像-文本相似度 ===")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

texts = ["a photo of a cat", "a photo of a dog", "a photo of two cats"]
inputs = clip_processor(text=texts, images=image, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = clip_model(**inputs)
    logits_per_image = outputs.logits_per_image  # 图像-文本相似度分数
    probs = logits_per_image.softmax(dim=-1)

for text, prob in zip(texts, probs[0]):
    print(f"'{text}': {prob.item():.4f}")

# ============================================================
# BLIP: 图像描述生成
# ============================================================
print("\n=== BLIP 图像描述 ===")
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

inputs = blip_processor(images=image, return_tensors="pt")
with torch.no_grad():
    out = blip_model.generate(**inputs, max_length=50)
caption = blip_processor.decode(out[0], skip_special_tokens=True)
print(f"描述: {caption}")

# ============================================================
# LLaVA: 视觉问答(需要安装 transformers 和 accelerate)
# ============================================================
print("\n=== LLaVA 视觉问答 ===")
from transformers import LlavaProcessor, LlavaForConditionalGeneration

# 注意:LLaVA 模型较大,需要足够显存
llava_model_id = "llava-hf/llava-1.5-7b-hf"
processor = LlavaProcessor.from_pretrained(llava_model_id)
llava_model = LlavaForConditionalGeneration.from_pretrained(llava_model_id, torch_dtype=torch.float16, device_map="auto")

# 对话格式
conversation = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "What are these animals doing?"}
    ]}
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(llava_model.device)

with torch.no_grad():
    output = llava_model.generate(**inputs, max_new_tokens=100)
response = processor.decode(output[0], skip_special_tokens=True)
print(f"回答: {response}")
17.3 音频模型(Whisper, Wav2Vec2)的使用

Transformers 也支持音频处理模型,如 Whisper(语音识别)和 Wav2Vec2(语音表示学习)。

# 文件名:audio_models_demo.py
from transformers import WhisperProcessor, WhisperForConditionalGeneration, Wav2Vec2Processor, Wav2Vec2ForCTC
import torchaudio
import requests
import torch

# ============================================================
# Whisper: 语音识别(ASR)
# ============================================================
print("=== Whisper 语音识别 ===")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")

# 下载示例音频(LibriSpeech 样本)
url = "https://huggingface.co/datasets/librispeech_asr/resolve/main/clean/1/1/1/1-1-1.flac"
audio_path = "sample.flac"
response = requests.get(url)
with open(audio_path, "wb") as f:
    f.write(response.content)

# 加载音频
speech_array, sampling_rate = torchaudio.load(audio_path)
speech_array = speech_array.squeeze().numpy()

# 预处理
inputs = whisper_processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt")

# 推理
with torch.no_grad():
    predicted_ids = whisper_model.generate(**inputs)
transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"转录文本: {transcription}")

# ============================================================
# Wav2Vec2: 语音识别(CTC)
# ============================================================
print("\n=== Wav2Vec2 语音识别 ===")
wav2vec_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
wav2vec_processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

inputs = wav2vec_processor(speech_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = wav2vec_model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = wav2vec_processor.decode(predicted_ids[0])
print(f"Wav2Vec2 转录: {transcription}")

# 清理
import os
os.remove(audio_path)

18. 自定义模型与扩展

18.1 理解 Transformers 中的模型注册机制

Transformers 使用自动注册机制,使得可以通过字符串名称加载任意支持的模型。核心是 AutoModelCONFIG_MAPPING

# 文件名:model_registry_demo.py
from transformers import AutoConfig, AutoModel, CONFIG_MAPPING, MODEL_MAPPING

# 1. 查看注册的配置和模型
print("已注册的配置类型数量:", len(CONFIG_MAPPING))
print("已注册的模型类型数量:", len(MODEL_MAPPING))

# 2. 手动通过配置类加载模型
config = AutoConfig.from_pretrained("bert-base-uncased")
model = AutoModel.from_config(config)  # 随机初始化,不是预训练权重

# 3. 查看模型的默认架构
print(f"BERT 的模型类: {type(model)}")

# 4. 自定义配置类的注册机制
from transformers import PretrainedConfig

class MyCustomConfig(PretrainedConfig):
    model_type = "my_custom_model"
    
    def __init__(self, hidden_size=768, num_layers=12, **kwargs):
        super().__init__(**kwargs)
        self.hidden_size = hidden_size
        self.num_layers = num_layers

# 注册配置(让 AutoConfig 能够识别)
from transformers import CONFIG_MAPPING
CONFIG_MAPPING.register("my_custom_model", MyCustomConfig)

print("自定义配置已注册")
18.2 在 Transformers 中添加自定义模型架构

添加自定义模型需要三个步骤:定义配置、定义模型、注册到 Auto 类。

# 文件名:add_custom_model.py
from transformers import PreTrainedModel, PretrainedConfig
import torch.nn as nn
import torch

# 1. 定义配置
class CustomConfig(PretrainedConfig):
    model_type = "custom_tiny_model"
    
    def __init__(self, vocab_size=30522, hidden_size=128, num_hidden_layers=2, num_attention_heads=2, **kwargs):
        super().__init__(**kwargs)
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads

# 2. 定义模型(简化的 Transformer)
class CustomModel(PreTrainedModel):
    config_class = CustomConfig
    
    def __init__(self, config):
        super().__init__(config)
        self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)
        # 简化:使用单层自注意力
        self.attention = nn.MultiheadAttention(config.hidden_size, config.num_attention_heads, batch_first=True)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size)
        self.init_weights()
    
    def forward(self, input_ids, attention_mask=None):
        x = self.embedding(input_ids)
        # 自注意力
        attn_output, _ = self.attention(x, x, x, key_padding_mask=attention_mask)
        logits = self.lm_head(attn_output)
        return logits
    
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)

# 3. 注册到 Auto 类
from transformers import AutoConfig, AutoModel

AutoConfig.register("custom_tiny_model", CustomConfig)
AutoModel.register(CustomConfig, CustomModel)

# 4. 使用自定义模型
config = CustomConfig(vocab_size=1000, hidden_size=64)
model = AutoModel.from_config(config)

print("自定义模型已注册并实例化")
print(f"模型结构: {model}")
18.3 使用 PreTrainedModel 基类开发新模型

PreTrainedModel 提供了权重保存、加载、设备映射等基础设施,开发新模型时应继承它。

# 文件名:pretrained_model_inheritance.py
from transformers import PreTrainedModel, PretrainedConfig
import torch.nn as nn
import torch

class MyTransformerConfig(PretrainedConfig):
    model_type = "my_transformer"
    
    def __init__(self, d_model=256, nhead=8, num_layers=6, vocab_size=30000, **kwargs):
        super().__init__(**kwargs)
        self.d_model = d_model
        self.nhead = nhead
        self.num_layers = num_layers
        self.vocab_size = vocab_size

class MyTransformerModel(PreTrainedModel):
    config_class = MyTransformerConfig
    base_model_prefix = "transformer"
    
    def __init__(self, config):
        super().__init__(config)
        self.embedding = nn.Embedding(config.vocab_size, config.d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model=config.d_model, nhead=config.nhead, batch_first=True)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=config.num_layers)
        self.lm_head = nn.Linear(config.d_model, config.vocab_size)
        
        # 初始化权重(可选)
        self.post_init()
    
    def forward(self, input_ids, attention_mask=None):
        x = self.embedding(input_ids)
        # Transformer 编码器期望的 mask 形状: (seq_len, batch_size) 或 (batch_size, seq_len)
        if attention_mask is not None:
            # 转换 mask 形状
            attention_mask = attention_mask == 0  # True 表示需要屏蔽的位置
        x = self.encoder(x, src_key_padding_mask=attention_mask)
        logits = self.lm_head(x)
        return logits
    
    def _init_weights(self, module):
        """初始化权重(PreTrainedModel 会调用此方法)"""
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)

# 使用自定义模型
config = MyTransformerConfig(d_model=128, nhead=4, num_layers=2, vocab_size=5000)
model = MyTransformerModel(config)

# 保存和加载
model.save_pretrained("./my_transformer")
loaded_model = MyTransformerModel.from_pretrained("./my_transformer")
print("自定义 PreTrainedModel 已保存和加载")
18.4 与 PyTorch Lightning、Keras 等框架的集成

Transformers 模型可以无缝集成到其他训练框架中。

# 文件名:integration_with_lightning.py
import pytorch_lightning as pl
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.utils.data import DataLoader, Dataset
import torch

# ============================================================
# 与 PyTorch Lightning 集成
# ============================================================
class LightningTransformer(pl.LightningModule):
    def __init__(self, model_name, num_labels=2):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def forward(self, input_ids, attention_mask):
        return self.model(input_ids, attention_mask=attention_mask)
    
    def training_step(self, batch, batch_idx):
        outputs = self(**batch)
        loss = outputs.loss
        self.log('train_loss', loss)
        return loss
    
    def validation_step(self, batch, batch_idx):
        outputs = self(**batch)
        loss = outputs.loss
        self.log('val_loss', loss)
        return loss
    
    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=2e-5)

# 准备数据
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(self.labels[idx])
        }

# 示例数据
texts = ["Great movie!", "Terrible film.", "Awesome performance", "Waste of time"]
labels = [1, 0, 1, 0]

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = TextDataset(texts, labels, tokenizer)
dataloader = DataLoader(dataset, batch_size=2)

# 创建 Lightning 模型
model = LightningTransformer("bert-base-uncased")

# 训练(使用 Lightning Trainer)
trainer = pl.Trainer(max_epochs=1, accelerator="auto", devices=1, fast_dev_run=True)
# trainer.fit(model, dataloader)  # 取消注释运行

print("PyTorch Lightning 集成示例完成")

# ============================================================
# 与 TensorFlow/Keras 集成
# ============================================================
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

# TensorFlow 版本的模型
tf_model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# 编译
tf_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
                 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

# 准备 TF 数据集(示例)
# def tokenize_for_tf(texts, tokenizer):
#     return tokenizer(texts, padding=True, truncation=True, return_tensors='tf')
# tf_dataset = tf.data.Dataset.from_tensor_slices((texts, labels)).map(...)
# tf_model.fit(tf_dataset, epochs=3)

print("TensorFlow/Keras 集成示例完成")

🌟 感谢您耐心阅读到这里!
💡 如果本文对您有所启发欢迎:
👍 点赞📌 收藏 📤 分享给更多需要的伙伴。
🗣️ 期待在评论区看到您的想法, 共同进步。
🔔 关注我,持续获取更多干货内容~
🤗 我们下篇文章见~

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐