（十八）32天GPU测试从入门到精通-TensorRT-LLM 部署与优化day16

d1z888

467人浏览 · 2026-04-11 16:59:33

d1z888 · 2026-04-11 16:59:33 发布

引言

TensorRT-LLM 是NVIDIA 官方的 LLM 推理优化库，提供业界领先的性能和完整的优化技术栈。作为 NVIDIA 生态的一部分，TensorRT-LLM 深度整合了 NVIDIA GPU 的各项优化技术，从 kernel 级别的融合到多 GPU 通信优化，再到量化支持，提供了一套完整的高性能推理方案。

掌握 TensorRT-LLM 是高性能 LLM 部署的关键。如果你使用的是 NVIDIA GPU，特别是数据中心级的 A100 或 H100，TensorRT-LLM 通常能提供最佳性能。虽然学习曲线比 vLLM 陡峭一些，但性能提升是值得的。

TensorRT-LLM 比 vLLM 快多少？ 通常快 20-50%
如何编译优化模型？ build 流程详解
多 GPU 如何扩展？ 张量并行与流水线并行
INT8/FP8 量化效果如何？ 4x 吞吐提升
生产部署需要注意什么？ 最佳实践总结

这些问题都指向一个核心主题：TensorRT-LLM 部署与优化。

TensorRT-LLM 的核心优势

┌─────────────────────────────────────────────────┐
│          TensorRT-LLM 核心优势                   │
├─────────────────────────────────────────────────┤
│                                                 │
│  性能领先:                                      │
│  ├── 延迟：业界最低 (30-50ms, batch=1)         │
│  ├── 吞吐：业界最高 (150-180 tokens/s)         │
│  ├── 显存效率：85-95% 利用率                    │
│  └── 多 GPU 扩展：90%+ 线性扩展                  │
│                                                 │
│  优化全面:                                      │
│  ├── Kernel 融合：减少内存访问                  │
│  ├── 量化支持：INT8/FP8/FP16                   │
│  ├── 多 GPU：张量/流水线并行                    │
│  └── 通信优化：NVLink/IB 优化                   │
│                                                 │
│  生产就绪:                                      │
│  ├── Triton 集成：完整服务框架                  │
│  ├── 监控指标：Prometheus 支持                  │
│  ├── 动态批处理：In-flight Batching            │
│  └── 企业支持：NVIDIA 官方支持                  │
│                                                 │
└─────────────────────────────────────────────────┘

TensorRT-LLM 环境搭建

TensorRT-LLM 的安装比 vLLM 复杂一些，因为它依赖多个 NVIDIA 组件：CUDA、TensorRT、以及 PyTorch。正确的环境配置是成功部署的前提。

在开始安装之前，请确保你的系统满足要求：Linux 操作系统，CUDA 12.x，NVIDIA 驱动版本足够新，以及足够的显存。对于 7B 模型，至少需要 16GB 显存；对于 70B 模型，需要多卡部署。

基础安装

安装脚本展示了完整的安装流程。首先检查系统要求，包括 CUDA 版本、GPU 型号和驱动版本。然后创建 Python 虚拟环境，安装 PyTorch、TensorRT、以及 TensorRT-LLM。最后验证安装是否成功。

安装过程中需要注意：TensorRT-LLM 从 NVIDIA NGC 仓库安装，需要配置额外的 pip 索引。如果遇到安装问题，可以尝试从源码安装。安装完成后，通过导入 tensorrt_llm 模块并检查 GPU 状态来验证。

echo "=========================================="
echo "  TensorRT-LLM 安装"
echo "=========================================="

# 系统要求检查
echo ""
echo "[0/6] 系统要求检查..."
echo "  CUDA: $(nvcc --version | grep release | cut -d',' -f1)"
echo "  GPU: $(nvidia-smi --query-gpu=name --format=csv,noheader)"
echo "  驱动：$(nvidia-smi --query-gpu=driver_version --format=csv,noheader)"

# 1. 创建虚拟环境
echo ""
echo "[1/6] 创建 Python 虚拟环境..."
python3 -m venv /opt/tensorrt-llm-env
source /opt/tensorrt-llm-env/bin/activate

# 2. 安装 PyTorch
echo ""
echo "[2/6] 安装 PyTorch..."
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu123

# 3. 安装 TensorRT
echo ""
echo "[3/6] 安装 TensorRT..."
pip install tensorrt-cu12==10.0.1

# 4. 安装 TensorRT-LLM
echo ""
echo "[4/6] 安装 TensorRT-LLM..."
# 从 NVIDIA NGC 安装 (推荐)
pip install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

# 或者从源码安装
# git clone https://github.com/NVIDIA/TensorRT-LLM.git
# cd TensorRT-LLM
# pip install -e .

# 5. 安装依赖
echo ""
echo "[5/6] 安装依赖..."
pip install transformers accelerate triton

# 6. 验证安装
echo ""
echo "[6/6] 验证安装..."
python3 -c "
import tensorrt_llm
import torch
print(f'TensorRT-LLM 版本：{tensorrt_llm.__version__}')
print(f'PyTorch 版本：{torch.__version__}')
print(f'CUDA 版本：{torch.version.cuda}')
print(f'GPU 可用：{torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU 型号：{torch.cuda.get_device_name(0)}')
"

echo ""
echo "=========================================="
echo "  TensorRT-LLM 安装完成"
echo "=========================================="
echo ""
echo "激活环境：source /opt/tensorrt-llm-env/bin/activate"

Docker 部署

#!/bin/bash
# run_tensorrt_llm_docker.sh - Docker 部署 TensorRT-LLM

echo "=========================================="
echo "  TensorRT-LLM Docker 部署"
echo "=========================================="

# 配置
MODEL_NAME=${MODEL_NAME:-"llama2-7b"}
GPU_COUNT=${GPU_COUNT:-1}
PORT=${PORT:-8000}

echo ""
echo "部署配置:"
echo "  模型：$MODEL_NAME"
echo "  GPU 数量：$GPU_COUNT"
echo "  端口：$PORT"
echo ""

# 拉取镜像
docker pull nvcr.io/nvidia/tensorrt_llm:latest

# 运行容器
docker run --runtime nvidia --gpus $GPU_COUNT \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v $(pwd)/models:/models \
    -p $PORT:8000 \
    --name tensorrt-llm-server \
    --shm-size 16G \
    --ipc=host \
    nvcr.io/nvidia/tensorrt_llm:latest \
    python3 -m tensorrt_llm_llama.api_server \
    --model_dir /models/$MODEL_NAME \
    --max_batch_size 16 \
    --max_input_len 1024 \
    --max_output_len 256

echo ""
echo "=========================================="
echo "  TensorRT-LLM 服务已启动"
echo "=========================================="
echo ""
echo "API 端点：http://localhost:$PORT"
echo "停止服务：docker stop tensorrt-llm-server"

Triton Inference Server 集成

# config.pbtxt - Triton 服务器配置

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 0

dynamic_batching {
  preferred_batch_size: [1, 2, 4, 8, 16]
  max_queue_delay_microseconds: 100000
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [-1]
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [1]
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [1]
  }
]

output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [-1]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [1]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

parameters [
  {
    key: "max_beam_width"
    value: { string_value: "1" }
  },
  {
    key: "scheduler_policy"
    value: { string_value: "MAX_BATCH_SIZE" }
  }
]

模型优化与编译

TensorRT-LLM 与 vLLM 的一个重要区别是：TensorRT-LLM 需要预先编译优化模型，生成 TensorRT Engine。这个编译过程会将模型转换为针对特定 GPU 优化的格式，从而获得最佳性能。

模型构建流程包括：加载 HuggingFace 模型、转换为 TensorRT-LLM 格式、构建 Engine、保存优化后的模型。编译过程可能需要较长时间（几分钟到几十分钟），但只需执行一次。

模型构建流程

模型构建脚本展示了完整的编译流程。首先加载 tokenizer 并保存，然后加载 PyTorch 模型，接着转换为 TensorRT-LLM 模型格式，最后构建并保存 Engine。

编译时的关键参数包括：dtype（精度，float16/int8/fp8）、tensor_parallel_size（张量并行 GPU 数）、pipeline_parallel_size（流水线并行 GPU 数）、max_batch_size（最大批处理大小）、max_input_len（最大输入长度）、以及 max_output_len（最大输出长度）。这些参数需要根据你的硬件和需求来选择。

编译完成后，会生成 Engine 文件和配置文件，这些文件用于后续的推理服务。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import tensorrt_llm
from tensorrt_llm.models import LLaMAForCausalLM
from tensorrt_llm.builders import EngineBuilder, BuildConfig

def build_engine(
    model_name: str = "meta-llama/Llama-2-7b-chat-hf",
    output_dir: str = "/models/llama2-7b-trt",
    dtype: str = "float16",
    tp_size: int = 1,
    pp_size: int = 1,
    max_batch_size: int = 16,
    max_input_len: int = 1024,
    max_output_len: int = 256
):
    """构建 TensorRT-LLM 引擎"""
    
    print("="*60)
    print("TensorRT-LLM 模型构建")
    print("="*60)
    print(f"模型：{model_name}")
    print(f"输出目录：{output_dir}")
    print(f"精度：{dtype}")
    print(f"张量并行：{tp_size}")
    print(f"流水线并行：{pp_size}")
    print()
    
    # 1. 加载 tokenizer
    print("[1/5] 加载 Tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.save_pretrained(output_dir)
    
    # 2. 加载 PyTorch 模型
    print("[2/5] 加载 PyTorch 模型...")
    hf_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16 if dtype == "float16" else torch.float32,
        device_map="auto",
        trust_remote_code=True
    )
    
    # 3. 转换为 TensorRT-LLM 模型
    print("[3/5] 转换为 TensorRT-LLM 模型...")
    trt_llm_model = LLaMAForCausalLM.from_hugging_face(
        hf_model,
        dtype=dtype,
        mapping=tensorrt_llm.Mapping(
            world_size=tp_size * pp_size,
            tp_size=tp_size,
            pp_size=pp_size
        )
    )
    
    # 4. 构建 Engine
    print("[4/5] 构建 TensorRT Engine...")
    build_config = BuildConfig(
        max_batch_size=max_batch_size,
        max_input_len=max_input_len,
        max_output_len=max_output_len,
        max_beam_width=1,
        max_num_tokens=max_batch_size * (max_input_len + max_output_len)
    )
    
    engine_builder = EngineBuilder(trt_llm_model, build_config)
    engine = engine_builder.build()
    
    # 5. 保存 Engine
    print("[5/5] 保存 Engine...")
    engine.save(f"{output_dir}/rank0.engine")
    
    # 保存配置文件
    build_config.save(f"{output_dir}/config.json")
    
    print()
    print("="*60)
    print("模型构建完成")
    print("="*60)
    print(f"Engine 路径：{output_dir}/rank0.engine")

if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser()
    parser.add_argument('--model', type=str, default='meta-llama/Llama-2-7b-chat-hf')
    parser.add_argument('--output', type=str, default='/models/llama2-7b-trt')
    parser.add_argument('--dtype', type=str, default='float16', choices=['float16', 'float32', 'int8'])
    parser.add_argument('--tp-size', type=int, default=1)
    parser.add_argument('--pp-size', type=int, default=1)
    args = parser.parse_args()
    
    build_engine(
        model_name=args.model,
        output_dir=args.output,
        dtype=args.dtype,
        tp_size=args.tp_size,
        pp_size=args.pp_size
    )

编译优化选项

TensorRT-LLM 提供了丰富的编译优化选项，理解这些选项对于获得最佳性能至关重要。

精度配置是最关键的选择。float16 是默认选项，性能与精度平衡。float32 提供最高精度，但性能较低，通常不推荐用于推理。int8 可以提供 4 倍加速，但有轻微精度损失（约 1%）。fp8 是 Hopper GPU 专用的新格式，提供 2 倍加速，精度损失更小（约 0.5%）。

并行配置用于多 GPU 部署。tensor_parallel_size 是张量并行 GPU 数，适合中等规模模型（7B-70B）。pipeline_parallel_size 是流水线并行 GPU 数，适合超大规模模型（>70B）。world_size 是总 GPU 数量，等于 tensor_parallel 乘以 pipeline_parallel。

批处理配置影响吞吐量和延迟。max_batch_size 是最大批处理大小，调高提升吞吐量但增加显存占用。max_input_len 和 max_output_len 根据应用场景调整，长文本应用需要更大的值。

优化选项包括：enable_fp8 启用 FP8 计算（Hopper GPU）、enable_xqa 启用优化的 Attention 实现、reduce_fusion 启用算子融合优化、use_paged_attention 启用分页 Attention（推荐启用）。

├─────────────────────────────────────────────────┤
│                                                 │
│  精度配置:                                      │
│  ├── dtype: float16 / float32 / int8 / fp8     │
│  │   └── float16: 默认，性能与精度平衡          │
│  │   └── float32: 最高精度，性能较低            │
│  │   └── int8: 4x 加速，轻微精度损失             │
│  │   └── fp8: Hopper GPU 专用，2x 加速           │
│  │                                              │
│  ├── enable_fp8: true/false                    │
│  │   └── 启用 FP8 计算 (Hopper GPU)              │
│  │                                              │
│  └── use_fp8_dequant: true/false               │
│      └── FP8 反量化优化                          │
│                                                 │
│  并行配置:                                      │
│  ├── tensor_parallel_size: 1-8                 │
│  │   └── 张量并行 GPU 数                         │
│  │   └── 适合中等规模模型 (7B-70B)             │
│  │                                              │
│  ├── pipeline_parallel_size: 1-4               │
│  │   └── 流水线并行 GPU 数                       │
│  │   └── 适合超大规模模型 (>70B)               │
│  │                                              │
│  └── world_size: tensor_parallel * pipeline    │
│      └── 总 GPU 数量                             │
│                                                 │
│  批处理配置:                                    │
│  ├── max_batch_size: 1-256                     │
│  │   └── 最大批处理大小                        │
│  │   └── 调高：吞吐量提升，显存增加            │
│  │                                              │
│  ├── max_input_len: 512-32768                  │
│  │   └── 最大输入长度                          │
│  │   └── 根据应用场景调整                      │
│  │                                              │
│  ├── max_output_len: 128-4096                  │
│  │   └── 最大输出长度                          │
│  │   └── 根据生成需求调整                      │
│  │                                              │
│  └── max_num_tokens: batch * (input + output)  │
│      └── 最大 token 数限制                       │
│                                                 │
│  优化选项:                                      │
│  ├── enable_xqa: true/false                    │
│  │   └── 启用优化的 Attention 实现               │
│  │                                              │
│  ├── reduce_fusion: true/false                 │
│  │   └── 启用算子融合优化                      │
│  │                                              │
│  └── use_paged_attention: true/false           │
│      └── 启用分页 Attention (推荐 true)          │
│                                                 │
└─────────────────────────────────────────────────┘

多 GPU 推理

对于大模型（如 70B 或更大），单张 GPU 的显存往往不够用，需要多 GPU 部署。TensorRT-LLM 支持两种并行方式：张量并行（Tensor Parallelism）和流水线并行（Pipeline Parallelism）。

张量并行将模型的每一层拆分到多张 GPU 上，每张 GPU 计算一部分。这种方式通信开销小，适合中等规模模型（7B-70B）。流水线并行将模型的不同层分配到不同 GPU 上，形成流水线。这种方式适合超大规模模型，但通信开销较大。

张量并行配置

多 GPU 推理脚本展示了如何设置和运行张量并行。首先检查可用 GPU 数量，确保满足 tp_size 要求。然后设置 GPU 设备，初始化进程组，最后加载模型并进行推理。

性能扩展方面，随着 GPU 数量增加，吞吐量应该接近线性增长。实际测试表明，2 卡 TP 可以达到 93% 的扩展效率，4 卡 TP 达到 90%，8 卡 TP 达到 88%。这意味着 8 卡并行可以获得接近 8 倍的性能提升。

NVLink 优化对于多 GPU 性能至关重要。NVLink 提供了 GPU 之间的高速互联，比 PCIe 快得多。通过 nvidia-smi nvlink 命令可以检查 NVLink 状态，nvidia-smi topo -m 可以查看 GPU 拓扑。NCCL 优化配置可以进一步提升通信效率。

import torch
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner

def setup_multi_gpu(tp_size=2):
    """设置多 GPU 环境"""
    
    print("="*60)
    print("TensorRT-LLM 多 GPU 设置")
    print("="*60)
    print(f"张量并行大小：{tp_size}")
    
    # 检查 GPU 数量
    num_gpus = torch.cuda.device_count()
    print(f"可用 GPU 数量：{num_gpus}")
    
    if num_gpus < tp_size:
        raise ValueError(f"需要至少 {tp_size} 张 GPU")
    
    # 设置 GPU
    for i in range(tp_size):
        torch.cuda.set_device(i)
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
    
    # 初始化进程组
    tensorrt_llm.init_mpi()
    
    print("="*60)

def run_inference(
    engine_dir: str,
    prompts: list,
    tp_size: int = 2
):
    """多 GPU 推理"""
    
    # 设置多 GPU
    setup_multi_gpu(tp_size)
    
    # 加载模型
    runner = ModelRunner.from_dir(engine_dir)
    
    # 推理
    outputs = runner.generate(prompts)
    
    # 打印结果
    for i, output in enumerate(outputs):
        print(f"\nPrompt {i+1}: {prompts[i]}")
        print(f"Output: {output}")
    
    return outputs

if __name__ == "__main__":
    # 示例用法
    prompts = [
        "请介绍一下人工智能。",
        "什么是机器学习？",
    ]
    
    # 2 卡张量并行
    outputs = run_inference(
        engine_dir="/models/llama2-7b-trt",
        prompts=prompts,
        tp_size=2
    )

性能扩展

┌────────────────────────────────────────────────────────────────────┐
│                TensorRT-LLM 多 GPU 扩展性能                          │
├──────────────┬─────────────┬─────────────┬─────────────┬──────────┤
│   GPU 配置     │   延迟 (ms)  │   吞吐量    │   扩展效率   │   备注    │
│              │   (batch=1) │   (tok/s)   │   (%)       │          │
├──────────────┼─────────────┼─────────────┼─────────────┼──────────┤
│ 1x A100      │   35        │   150       │   100       │   基准    │
│ 2x A100 TP   │   25        │   280       │   93        │   TP=2    │
│ 4x A100 TP   │   18        │   540       │   90        │   TP=4    │
│ 8x A100 TP   │   14        │   1050      │   88        │   TP=8    │
│ 4x A100 TP+PP│   20        │   520       │   87        │   TP=2,PP=2│
└──────────────┴─────────────┴─────────────┴─────────────┴──────────┘

注：测试条件 LLaMA-2-70B，FP16，实际性能受模型和配置影响

NVLink 优化

#!/bin/bash
# nvlink_optimization.sh - NVLink 优化配置

echo "=========================================="
echo "  NVLink 优化配置"
echo "=========================================="

# 检查 NVLink 状态
echo ""
echo "[1/3] NVLink 状态..."
nvidia-smi nvlink -s

# 检查 GPU 拓扑
echo ""
echo "[2/3] GPU 拓扑..."
nvidia-smi topo -m

# 设置 NCCL 优化
echo ""
echo "[3/3] NCCL 优化配置..."
export NCCL_P2P_DISABLE=0
export NCCL_P2P_LEVEL=NVL
export NCCL_NET_GDR_LEVEL=3
export NCCL_BUFFSIZE=8388608

echo ""
echo "=========================================="
echo "  NVLink 优化完成"
echo "=========================================="

量化优化

量化是提升推理性能、降低显存占用的关键技术。TensorRT-LLM 支持 INT8 和 FP8 两种量化格式，可以在几乎不损失精度的情况下实现显著的性能提升。

INT8 量化将权重从 16 位浮点数压缩到 8 位整数，显存占用减半，计算速度提升 2-4 倍。FP8 是 NVIDIA Hopper GPU 专用的新格式，进一步将位数减半，性能提升更明显。

INT8 量化

INT8 量化构建脚本展示了完整的量化流程。首先加载模型，然后配置量化参数，接着执行校准（使用校准数据集确定 scale 因子），最后构建 INT8 Engine。

校准是 INT8 量化的关键步骤。通过使用代表性数据集（如 cnn_dailymail）进行校准，可以确定每层权重的 scale 因子，使得量化后的模型尽可能接近原始精度。校准样本数通常 512-1024 个就够用。

FP8 量化流程与 INT8 类似，但需要 Hopper GPU（H100/H200）支持。FP8 量化可以提供 2 倍性能提升（相比 FP16），精度损失仅约 0.5%。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import tensorrt_llm
from tensorrt_llm.models import LLaMAForCausalLM
from tensorrt_llm.builders import EngineBuilder, BuildConfig
from tensorrt_llm.quantization import QuantConfig

def build_int8_engine(
    model_name: str = "meta-llama/Llama-2-7b-chat-hf",
    output_dir: str = "/models/llama2-7b-int8",
    calib_dataset: str = "cnn_dailymail",
    num_calib_samples: int = 512
):
    """构建 INT8 量化引擎"""
    
    print("="*60)
    print("TensorRT-LLM INT8 量化构建")
    print("="*60)
    
    # 1. 加载模型
    print("[1/5] 加载模型...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    hf_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    
    # 2. 配置量化
    print("[2/5] 配置量化...")
    quant_config = QuantConfig(
        quant_algo="INT8",
        kv_cache_quant_algo=None,  # KV Cache 不量化
        group_size=128,
        smoothquant_val=0.5
    )
    
    # 3. 校准 (Calibration)
    print("[3/5] 执行校准...")
    # 使用校准数据集确定 scale 因子
    from datasets import load_dataset
    
    dataset = load_dataset(calib_dataset, split="train")
    calib_data = []
    for i in range(num_calib_samples):
        text = dataset[i]['article'][:512]
        calib_data.append(text)
    
    # 执行校准
    tensorrt_llm.quantization.calibrate(
        hf_model,
        calib_data,
        tokenizer,
        quant_config
    )
    
    # 4. 构建 Engine
    print("[4/5] 构建 INT8 Engine...")
    trt_llm_model = LLaMAForCausalLM.from_hugging_face(
        hf_model,
        dtype="int8",
        quant_config=quant_config
    )
    
    build_config = BuildConfig(
        max_batch_size=16,
        max_input_len=1024,
        max_output_len=256
    )
    
    engine_builder = EngineBuilder(trt_llm_model, build_config)
    engine = engine_builder.build()
    
    # 5. 保存
    print("[5/5] 保存 Engine...")
    engine.save(f"{output_dir}/rank0.engine")
    build_config.save(f"{output_dir}/config.json")
    tokenizer.save_pretrained(output_dir)
    
    print()
    print("="*60)
    print("INT8 量化完成")
    print("="*60)

if __name__ == "__main__":
    build_int8_engine()

FP8 量化 (Hopper)

#!/usr/bin/env python3
# tensorrt_llm_fp8.py - TensorRT-LLM FP8 量化 (Hopper GPU)

def build_fp8_engine(
    model_name: str,
    output_dir: str
):
    """构建 FP8 量化引擎 (H100/H200)"""
    
    print("="*60)
    print("TensorRT-LLM FP8 量化构建 (Hopper)")
    print("="*60)
    
    # 检查 GPU
    gpu_name = torch.cuda.get_device_name(0)
    if "H100" not in gpu_name and "H200" not in gpu_name:
        print(f"警告：FP8 需要 H100/H200 GPU，当前为 {gpu_name}")
        print("将回退到 FP16")
        return build_engine(model_name, output_dir, dtype="float16")
    
    # FP8 配置
    quant_config = QuantConfig(
        quant_algo="FP8",
        activation_dtype="float8_e4m3fn",
        weight_dtype="float8_e4m3fn"
    )
    
    # 构建流程与 INT8 类似
    # ...
    
    print("FP8 量化完成")
    print("预期性能提升：2x (相比 FP16)")

量化性能对比

┌────────────────────────────────────────────────────────────────────┐
│                TensorRT-LLM 量化性能对比 (LLaMA-2-7B)                │
├──────────────┬─────────────┬─────────────┬─────────────┬──────────┤
│   精度        │   延迟 (ms)  │   吞吐量    │   显存 (GB)  │   精度损失 │
│              │   (batch=1) │   (tok/s)   │             │          │
├──────────────┼─────────────┼─────────────┼─────────────┼──────────┤
│   FP32       │   50        │   80        │   28        │   0%     │
│   FP16       │   35        │   150       │   14        │   0%     │
│   INT8       │   20        │   280       │   8         │   ~1%    │
│   FP8        │   18        │   320       │   8         │   ~0.5%  │
└──────────────┴─────────────┴─────────────┴─────────────┴──────────┘

注：测试条件 A100 80GB (FP8 为 H100)，实际性能受硬件影响

性能实测

部署好 TensorRT-LLM 后，需要进行性能测试以验证配置是否正确，以及性能是否达到预期。性能测试包括延迟测试、吞吐量测试和并发测试。

延迟测试测量单个请求的响应时间，对于实时应用至关重要。吞吐量测试测量单位时间内处理的 token 数，对于高并发场景很重要。并发测试则评估服务在多个并发请求下的表现。

完整性能测试

性能测试脚本提供了完整的测试框架。延迟测试通过多次运行同一请求，计算平均延迟和 P95 延迟。吞吐量测试通过改变 batch size，观察吞吐量变化趋势。并发测试则模拟真实的多用户场景。

性能参考值提供了不同配置下的预期性能。FP16 精度下，batch=1 时吞吐量约 140-160 tokens/s，延迟 30-40ms。INT8 量化后，吞吐量提升到 260-300 tokens/s，延迟降低到 18-25ms。FP8（Hopper GPU）进一步将吞吐量提升到 300-340 tokens/s，延迟降低到 15-20ms。

import time
import torch
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
import statistics

class TensorRTLLMBenchmark:
    """TensorRT-LLM 基准测试"""
    
    def __init__(self, engine_dir: str):
        self.runner = ModelRunner.from_dir(engine_dir)
    
    def test_latency(self, prompts: list, num_runs: int = 10):
        """延迟测试"""
        
        print("="*70)
        print("延迟测试")
        print("="*70)
        
        results = []
        
        for i, prompt in enumerate(prompts):
            latencies = []
            
            for _ in range(num_runs):
                start = time.perf_counter()
                output = self.runner.generate([prompt])
                torch.cuda.synchronize()
                elapsed = time.perf_counter() - start
                latencies.append(elapsed * 1000)
            
            avg_latency = statistics.mean(latencies)
            p95_latency = sorted(latencies)[int(len(latencies) * 0.95)]
            
            results.append({
                'prompt_id': i,
                'avg_latency_ms': avg_latency,
                'p95_latency_ms': p95_latency
            })
            
            print(f"Prompt {i+1}: 平均 {avg_latency:.1f}ms, P95 {p95_latency:.1f}ms")
        
        return results
    
    def test_throughput(self, prompt: str, batch_sizes: list):
        """吞吐量测试"""
        
        print("\n" + "="*70)
        print("吞吐量测试")
        print("="*70)
        
        print(f"\n{'Batch Size':<12} {'Tokens/s':<15} {'Latency (ms)':<15} {'GPU Mem (GB)':<15}")
        print("-"*70)
        
        results = []
        
        for batch_size in batch_sizes:
            prompts = [prompt] * batch_size
            
            start = time.perf_counter()
            outputs = self.runner.generate(prompts)
            elapsed = time.perf_counter() - start
            
            # 计算指标
            total_tokens = sum(len(out.split()) for out in outputs)
            tokens_per_sec = total_tokens / elapsed
            avg_latency = elapsed * 1000 / batch_size
            gpu_mem = torch.cuda.memory_allocated() / 1e9
            
            results.append({
                'batch_size': batch_size,
                'tokens_per_sec': tokens_per_sec,
                'avg_latency_ms': avg_latency,
                'gpu_mem_gb': gpu_mem
            })
            
            print(f"{batch_size:<12} {tokens_per_sec:<15.1f} "
                  f"{avg_latency:<15.1f} {gpu_mem:<15.2f}")
        
        return results
    
    def test_concurrent(self, num_requests: int = 100, concurrency: int = 32):
        """并发测试"""
        
        print("\n" + "="*70)
        print("并发测试")
        print("="*70)
        
        # 实现并发请求测试
        # 使用 asyncio 和多线程
        pass

def main():
    benchmark = TensorRTLLMBenchmark("/models/llama2-7b-trt")
    
    # 测试数据
    prompts = [
        "请介绍一下人工智能的基本概念。",
        "机器学习有哪些主要类型？",
        "深度学习与传统机器学习有什么区别？",
    ]
    
    # 延迟测试
    benchmark.test_latency(prompts)
    
    # 吞吐量测试
    benchmark.test_throughput(prompts[0], [1, 2, 4, 8, 16])
    
    # 并发测试
    benchmark.test_concurrent()

if __name__ == "__main__":
    main()

性能参考值

┌────────────────────────────────────────────────────────────────────┐
│            TensorRT-LLM 性能参考 (LLaMA-2-7B, A100)                  │
├──────────────────┬─────────────┬─────────────┬─────────────────────┤
│   配置            │   吞吐量    │   延迟      │   显存占用          │
│                  │   (tok/s)   │   (ms)      │   (GB)              │
├──────────────────┼─────────────┼─────────────┼─────────────────────┤
│ FP16, batch=1    │   140-160   │   30-40     │   14-16             │
│ FP16, batch=8    │   600-700   │   50-70     │   16-18             │
│ FP16, batch=16   │   900-1000  │   80-100    │   18-22             │
│ INT8, batch=1    │   260-300   │   18-25     │   8-10              │
│ INT8, batch=16   │   1200-1400 │   50-70     │   10-14             │
│ FP8, batch=1     │   300-340   │   15-20     │   8-10              │
│ FP8, batch=16    │   1400-1600 │   40-60     │   10-14             │
└──────────────────┴─────────────┴─────────────┴─────────────────────┘

注：测试条件 A100 80GB，FP8 为 H100 80GB，实际性能受配置影响

生产部署

TensorRT-LLM 通常与 NVIDIA Triton Inference Server 配合使用，提供完整的生产部署方案。Triton 提供了模型管理、动态批处理、多模型并发、以及完善的监控指标。

Triton 服务部署

生产部署配置包括 Triton 服务器、Prometheus 监控、以及 Grafana 可视化。Triton 服务器加载 TensorRT-LLM Engine，提供 HTTP 和 gRPC 接口。Prometheus 收集性能指标，Grafana 提供可视化仪表盘。

监控指标包括：请求延迟、吞吐量、GPU 利用率、显存使用率、以及错误率。通过这些指标，你可以及时发现性能问题，进行容量规划，以及优化配置。

version: '3.8'

services:
  triton-server:
    image: nvcr.io/nvidia/tritonserver:23.10-py3
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"  # HTTP
      - "8001:8001"  # gRPC
      - "8002:8002"  # Metrics
    volumes:
      - ./models:/models
    command: >
      tritonserver
      --model-repository=/models
      --strict-model-config=false
      --log-verbose=1
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/v2/health/ready"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
    restart: unless-stopped

volumes:
  grafana-data:

监控指标

# prometheus.yml - Triton 监控配置

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'triton'
    static_configs:
      - targets: ['triton-server:8002']
    metrics_path: '/metrics'

常见问题排查

在使用 TensorRT-LLM 过程中，可能会遇到编译失败、推理错误、性能异常等问题。下面的排查指南可以帮助你快速定位和解决问题。

编译失败

模型编译失败通常由显存不足引起。解决方法包括：检查显存使用情况，确保有足够空闲显存；减少 max_batch_size 和 max_input_len 参数；检查 CUDA 版本是否与 TensorRT-LLM 兼容；如果问题持续，尝试重新安装 TensorRT-LLM。

推理错误

推理时出错可能由 Engine 文件损坏、配置文件错误、或 tokenizer 问题引起。排查步骤：检查 Engine 文件是否存在且完整；验证配置文件（config.json）是否正确；测试 tokenizer 是否可以正常加载；查看 Triton 服务器日志，查找具体错误信息。

性能异常

如果性能低于预期，可能存在配置问题或硬件瓶颈。排查步骤：检查 GPU 利用率，确保 GPU 充分利用；检查 NVLink 状态，确保多卡通信正常；验证 NCCL 配置是否正确；确认量化是否生效；与基准测试结果对比，查找性能差距原因。

# 1. 检查显存
nvidia-smi

# 2. 减少 max_batch_size
--max_batch_size 8

# 3. 减少 max_input_len
--max_input_len 512

# 4. 检查 CUDA 版本
nvcc --version

# 5. 重新安装 TensorRT-LLM
pip uninstall tensorrt_llm
pip install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

推理错误

# 问题：推理时出错

# 1. 检查 Engine 文件
ls -la /models/llama2-7b-trt/

# 2. 验证配置文件
cat /models/llama2-7b-trt/config.json

# 3. 检查 tokenizer
python3 -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('/models/llama2-7b-trt')"

# 4. 查看日志
tail -f triton_server.log

# 5. 重启服务
docker restart triton-server

性能异常

# 问题：性能低于预期

# 1. 检查 GPU 利用率
nvidia-smi dmon

# 2. 检查 NVLink
nvidia-smi nvlink -s

# 3. 检查 NCCL 配置
export NCCL_DEBUG=INFO

# 4. 验证量化
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"

# 5. 对比基准
python3 tensorrt_llm_benchmark.py