Java 程序员第 30 阶段：Docker容器化部署大模型：标准化环境一键上线

布吉岛的石头

580人浏览 · 2026-05-25 20:32:04

布吉岛的石头 · 2026-05-25 20:32:04 发布

作者：洛水石 | 适用：Java大模型面试

在人工智能飞速发展的今天，大语言模型（Large Language Model，LLM）已经成为企业智能化转型的核心基础设施。然而，大模型的部署却面临诸多挑战：复杂的依赖环境、CUDA版本兼容性、GPU资源调度、多模型协作等，每一个环节都可能成为生产落地的拦路虎。Docker容器化技术的出现，为这些问题提供了优雅的解决方案。

本文将深入探讨如何使用Docker容器化技术实现大模型的标准化部署，从环境配置、镜像制作、服务编排到生产级优化，形成一套完整的大模型容器化部署实践指南。

Docker容器化大模型的优势

环境标准化与一致性

传统的机器学习部署往往面临"在我机器上能跑"（It Works on My Machine）的困境。不同服务器的操作系统版本、CUDA驱动、Python库版本等细微差异，都可能导致模型推理行为不一致。Docker容器通过将应用及其所有依赖打包成镜像，确保了从开发到生产的完整环境一致性。

当企业需要部署多个大模型服务时，容器化优势更加明显。每个模型服务可以运行在独立的容器中，拥有自己的CUDA版本和依赖环境，彻底消除模型之间的依赖冲突。例如，可以使用一个容器运行基于CUDA 11.8的LLaMA-2-13B，另一个容器运行基于CUDA 12.1的Qwen-7B，它们之间完全隔离，互不影响。

资源隔离与高效利用

大模型推理是资源密集型任务，对GPU显存、内存、CPU都有较高要求。Docker容器提供了精细的资源隔离机制，可以为每个容器设置GPU数量、内存上限、CPU配额等参数。这种隔离不仅保证了服务质量，还提高了集群资源的整体利用率。

在共享GPU集群场景下，容器化允许将一张GPU卡分配给多个容器使用，每个容器只占用其中一部分显存。通过`nvidia-container-toolkit`提供的GPU虚拟化能力，可以实现GPU时间的时分复用，让小模型推理任务也能高效利用昂贵的GPU资源。

快速部署与弹性伸缩

Docker镜像是自包含的部署单元，只需一次构建，即可在任何支持Docker的环境中运行。这大大简化了大模型的分发和部署流程。对于需要频繁更新模型版本的场景，容器化可以实现蓝绿部署或金丝雀发布，将版本切换的停机时间压缩到秒级。

结合Kubernetes等容器编排平台，大模型服务可以根据负载情况自动伸缩。当请求量增加时，自动启动更多推理实例；当负载降低时，回收多余实例。这种弹性伸缩能力对于处理突发流量尤为重要，比如电商大促或热点事件引发的推理请求峰值。

简化运维与故障恢复

容器化架构使得大模型服务的运维工作变得简单可控。每个容器的文件系统、进程空间、网络栈都是独立的，容器内的故障不会影响宿主机和其他容器。当服务出现问题时，可以快速定位到具体容器，并利用容器快照进行故障恢复或问题复现。

Docker的日志驱动机制将容器日志统一收集到标准输出，配合ELK（Elasticsearch、Logstash、Kibana）或Loki等日志系统，可以实现全量日志的聚合查询和实时分析。这种集中化的日志管理对于排查生产环境问题至关重要。

▲ 图1

NVIDIA Docker配置与GPU支持

nvidia-container-toolkit工作原理

在Docker容器中使用GPU，需要依赖nvidia-container-toolkit（原nvidia-docker2）提供的容器运行时支持。其核心原理是在容器启动时，将宿主机的NVIDIA驱动和GPU设备映射到容器内部，使容器内的CUDA程序能够直接访问物理GPU。

nvidia-container-toolkit通过Linux的设备映射（device mapper）和挂载命名空间（mount namespace）机制实现GPU穿透。当容器启动时，nvidia-container-runtime会拦截容器创建请求，将`/dev/nvidia*`设备节点和NVIDIA驱动库挂载到容器内部。这个过程对用户透明，开发者只需在`docker run`命令中指定`--gpus`参数即可。

技术实现上，nvidia-container-toolkit包含三个核心组件：`nvidia-container-runtime`负责与Docker通信，执行GPU挂载操作；`libnvidia-container`提供了GPU挂载的具体实现逻辑；`nvidia-container-toolkit`是命令行工具，用于配置容器运行时行为。这三者的协作使得GPU容器化成为可能。

CUDA版本兼容性矩阵

选择正确的CUDA版本是构建大模型Docker镜像的关键决策。CUDA版本的选择需要考虑多个因素：GPU驱动版本、深度学习框架兼容性、模型运行时要求等。以下是选择CUDA版本时需要遵循的核心原则：

**第一，容器内CUDA版本必须小于或等于主机驱动支持的最高CUDA版本。** 可以通过`nvidia-smi`命令查看主机驱动的CUDA支持版本。如果主机驱动为NVIDIA 535版本，那么容器内最高只能使用CUDA 12.1（12.x系列的最新稳定版）。强行使用更高版本的CUDA将导致运行时错误。

**第二，深度学习框架对CUDA版本有明确要求。** 以PyTorch为例，PyTorch 2.0+推荐使用CUDA 11.8或CUDA 12.1；TensorFlow 2.13+则推荐CUDA 12.1。使用不兼容的版本组合可能导致模型加载失败或推理结果异常。

**第三，cuDNN版本需要与CUDA版本匹配。** cuDNN（CUDA Deep Neural Network Library）是CUDA的深度学习加速库，其版本号遵循`major.minor.patch`格式。安装cuDNN时需要确认其与CUDA版本的兼容性，可以通过NVIDIA官方文档查询版本对应关系。

在实际项目中，推荐使用NVIDIA官方提供的CUDA基础镜像，如`nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04`，这类镜像已经过严格测试，包含了经过验证的CUDA和cuDNN版本组合。

GPU资源配额配置

在生产环境中，通常需要在单个GPU上运行多个容器，或者限制单个容器可以使用的GPU资源。Docker提供了多种GPU资源配额配置方式。

**GPU数量限制**通过`--gpus`参数实现，可以指定容器可以使用的GPU数量。例如，`--gpus '"device=0"'`限制容器只能使用GPU 0，`--gpus '"device=0,1"'`则允许容器使用GPU 0和1。对于需要GPU数量感知的模型推理，可以使用`--gpus all`授予容器使用所有可用GPU。

**显存限制**需要通过环境变量配置，不同的推理引擎有不同的显存配置方式。vLLM使用`NVIDIA_VISIBLE_DEVICES`环境变量配合`gpu_memory_utilization`参数；TensorRT则使用`TF_FORCE_GPU_ALLOW_GROWTH=true`允许显存动态增长。显尔的显存预留可以通过`NVIDIA_DRIVER_CAPABILITIES`设置容器的GPU功能集。

**GPU算力限制**在某些场景下可能需要限制容器使用GPU的算力比例。这可以通过NVIDIA的GPU分区工具（mig-manager）实现，将GPU划分为多个GPU实例（GPI），每个实例拥有独立的算力和显存资源。不过，GPU分区会引入额外的开销，对于延迟敏感的推理任务通常不推荐使用。

验证GPU容器配置

配置完成后，需要验证GPU是否可以正常访问。可以通过以下命令检查容器内的GPU环境：

## 大模型Docker镜像制作

### Base Image选择策略

选择合适的基础镜像是构建大模型Docker镜像的第一步。NVIDIA官方维护的CUDA基础镜像是最佳选择，它们经过严格测试，包含了经过验证的驱动和库组合。

**Ubuntu基础镜像 vs CentOS基础镜像：** Ubuntu是目前主流的Docker镜像操作系统，拥有更广泛的软件包支持和更快的安全更新。对于需要频繁更新安全补丁的生产环境，Ubuntu是更好的选择。CentOS则以其稳定性著称，适合对系统一致性要求极高的企业级场景。

**-devel镜像 vs runtime镜像：** NVIDIA CUDA镜像分为`devel`和`runtime`两个版本。`devel`镜像包含完整的编译工具链（gcc、nvcc等），体积较大（约十几GB），适合需要编译CUDA程序或构建依赖CUDA的程序场景。`runtime`镜像只包含运行时库，体积较小（约5GB），适合直接运行已编译好的CUDA程序。对于大模型推理服务，推荐使用`runtime`镜像以减小镜像体积和攻击面。

** alpine基础镜像的局限性：** 部分大模型官方镜像使用alpine作为基础系统以减小体积。但alpine的libc为musl而非glibc，可能导致某些Python扩展（如numpy、scipy的底层C库）编译或运行问题。如果遇到奇怪的动态链接错误，可能需要切换到Ubuntu或CentOS基础镜像。

### CUDA版本与依赖环境配置

构建大模型Docker镜像时，需要在Dockerfile中显式指定CUDA版本和必要的依赖环境。以下是一个典型的Dockerfile结构：

```dockerfile
# 阶段一：构建阶段
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04 AS builder

# 设置环境变量
ENV DEBIAN_FRONTEND=noninteractive
ENV CUDA_VERSION=12.1.1
ENV PYTHON_VERSION=3.10

# 安装构建工具和Python
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    wget \
    && rm -rf /var/lib/apt/lists/*

# 设置Python符号链接
RUN ln -sf /usr/bin/python3.10 /usr/bin/python && \
    ln -sf /usr/bin/pip3 /usr/bin/pip

# 安装PyTorch（支持CUDA 12.1）
RUN pip install --no-cache-dir \
    torch==2.1.0 \
    torchvision==0.16.0 \
    torchaudio==2.1.0 \
    --index-url https://download.pytorch.org/whl/cu121

# 安装推理框架
RUN pip install --no-cache-dir \
    vllm==0.2.0 \
    transformers==4.35.0 \
    accelerate==0.25.0

# 阶段二：运行时阶段
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

# 复制Python环境
COPY --from=builder /usr/bin/python /usr/bin/python
COPY --from=builder /usr/bin/pip /usr/bin/pip
COPY --from=builder /usr/lib/python3.10 /usr/lib/python3.10

# 复制已安装的Python包
COPY --from=builder /usr/local/lib/python3.10/dist-packages /usr/local/lib/python3.10/dist-packages

# 安装运行时依赖
RUN apt-get update && apt-get install -y \
    libgomp1 \
    libgl1-mesa-glx \
    libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# 设置工作目录
WORKDIR /app

# 复制应用代码
COPY app.py /app/
COPY requirements.txt /app/

# 暴露端口
EXPOSE 8000

# 启动命令
CMD ["python", "app.py"]
```

### 镜像构建优化技巧

大模型Docker镜像通常体积较大（动辄十几GB），优化构建过程可以显著缩短构建时间和减少镜像体积。

**多阶段构建**是减小镜像体积的有效方法。如上例所示，第一阶段使用`devel`镜像安装编译工具和依赖，生成最终的可执行文件；第二阶段使用`runtime`镜像，只复制运行所需的文件。Python的`pip wheel`安装方式可以将已编译的包缓存起来，避免在运行时重新编译。

**层缓存优化**利用Docker的层缓存机制加速构建。将不经常变化的指令（如系统包安装、Python环境配置）放在Dockerfile前面，将频繁变化的指令（如应用代码复制）放在后面。这样，当应用代码变化时，可以复用前面不变的层，只重新构建后面的层。

**合并RUN指令**可以减少Docker镜像的层数。将多个`RUN`指令合并为一个，可以减小镜像体积并加快容器启动速度。但需要注意，合并后的RUN指令不宜过长，否则会降低Dockerfile的可读性和可维护性。

**使用.dockerignore文件**排除不需要的文件进入构建上下文，避免将大文件（如训练数据、缓存的模型权重）复制到镜像中，导致镜像体积膨胀。

### 构建参数与环境变量

为了提高Dockerfile的灵活性，可以使用`ARG`和`ENV`指令定义构建参数和环境变量：

```dockerfile
# 构建参数（可以在docker build时传入）
ARG CUDA_VERSION=12.1.1
ARG PYTHON_VERSION=3.10
ARG MODEL_NAME=Qwen-7B-Chat

# 环境变量（容器运行时可见）
ENV MODEL_NAME=${MODEL_NAME}
ENV CUDA_VISIBLE_DEVICES=0
ENV TRANSFORMERS_CACHE=/models/.cache
ENV HF_HOME=/models/.cache

# 使用构建参数动态选择基础镜像
FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-runtime-ubuntu22.04
```

构建命令示例：

```bash
docker build \
    --build-arg CUDA_VERSION=12.1.1 \
    --build-arg MODEL_NAME=Qwen-7B-Chat \
    -t my-llm-service:latest \
    -f Dockerfile .
```

## 主流大模型Docker镜像

### Ollama官方镜像

Ollama是近年来备受关注的大模型推理框架，以其极简的使用方式和出色的本地部署体验著称。Ollama官方提供了预构建的Docker镜像，极大地简化了本地部署流程。

**ollama/ollama**镜像是Ollama的核心运行时，包含了Ollama服务程序和完整的模型推理能力。使用该镜像启动服务非常简单：

```bash
docker run -d \
    --name ollama \
    -p 11434:11434 \
    --gpus all \
    ollama/ollama:latest
```

服务启动后，可以通过REST API访问Ollama的各种功能。Ollama支持的模型数量众多，包括LLaMA-2、Mistral、Qwen、Yi等主流开源大模型。模型管理通过Ollama命令行工具完成：

```bash
# 进入容器
docker exec -it ollama ollama

# 拉取模型
pull qwen:7b
pull llama2:13b
pull mistral:7b

# 运行推理
run qwen:7b "请介绍一下Docker容器化技术"
```

**Ollama的优势**在于其模型管理的便捷性。Ollama会自动处理模型的下载、缓存和版本管理，避免了手动下载模型权重和配置文件的繁琐。Ollama还支持Modelfile，允许用户自定义模型的行为参数，如系统提示、温度系数、上下文窗口大小等。

**Ollama的局限**也很明显：Ollama主要面向个人用户和本地部署场景，其生产级特性（如高可用、水平扩展、详细的监控指标）相对较弱。对于需要构建高可用推理服务的企业用户，可能需要结合其他方案使用Ollama。

### HuggingFace TGI镜像

HuggingFace Text Generation Inference（TGI）是HuggingFace官方推出的推理服务框架，专为生产级大模型部署设计。TGI镜像提供了完整的推理服务能力，包括批量推理、流式输出、张量并行等高级特性。

**ghcr.io/huggingface/text-generation-inference**是TGI的官方镜像仓库。使用TGI部署Qwen-7B模型的示例：

```bash
docker run -d \
    --name tgi \
    --gpus all \
    -p 8080:80 \
    -e MODEL_ID=Qwen/Qwen-7B-Chat \
    -e QUANTIZE=bitsandbytes \
    -e MAX_INPUT_LENGTH=2048 \
    -e MAX_TOTAL_TOKENS=4096 \
    ghcr.io/huggingface/text-generation-inference:latest
```

**TGI的核心特性**包括：

**批量推理优化**允许同时处理多个推理请求，通过动态批处理（Dynamic Batching）提高GPU利用率。TGI会根据请求情况自动将多个小批量合并成大批量处理，减少GPU空闲时间。

**张量并行**支持将大模型分布到多个GPU上运行。对于单卡无法容纳的大模型（如70B参数规模），张量并行可以将模型切分到多张GPU，突破单卡显存限制。配置非常简单，只需设置`--num_shard`参数即可。

**流式输出**通过Server-Sent Events（SSE）协议实现 token by token的流式返回，显著降低首token延迟，提升用户体验。TGI的流式输出实现经过优化，能够在高并发场景下保持稳定。

**Flash Attention**是TGI默认启用的注意力机制优化。Flash Attention通过IO感知的注意力计算，显著降低显存占用同时提升计算效率，对于长上下文推理尤为重要。

### vLLM官方镜像

vLLM是UC Berkeley开发的PagedAttention推理引擎，以其高吞吐量和低延迟著称。vLLM采用分页注意力机制管理KV缓存，相同硬件条件下，吞吐量可达HuggingFace实现的好几倍。

**vllm/vllm-openai**镜像是vLLM官方提供的兼容OpenAI API的推理服务镜像。使用该镜像启动vLLM服务：

```bash
docker run -d \
    --name vllm \
    --gpus all \
    -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e MODEL_NAME=Qwen/Qwen-7B-Chat \
    -e TOKENIZER=Qwen/Qwen-7B-Chat \
    vllm/vllm-openai:latest
```

**vLLM的性能优势**来源于其创新的PagedAttention技术。传统推理过程中，KV缓存需要连续分配显存，但生成的token数量在请求处理完成前是未知的，导致显存碎片化严重。vLLM将KV缓存组织成固定大小的"页"，像操作虚拟内存一样管理注意力缓存，显著提高显存利用率。

**Continuous Batching**是vLLM的另一项关键技术。传统批处理需要等待一个批次中所有请求完成才能开始下一个批次，造成GPU空闲。Continuous Batching允许完成的请求立即退出，新请求即时加入，实现更高的GPU利用率。

**vLLM的适用场景**包括：对延迟敏感的在线推理服务、需要处理大量并发请求的场景、显存受限但需要运行更大模型的场景。vLLM目前主要支持NVIDIA GPU，对于AMD GPU或CPU推理的支持仍在积极开发中。

### 其他值得关注的镜像

**lmz/cuda-lm-headers**镜像是国内开发者维护的轻量级CUDA开发环境，包含常用的深度学习开发工具和头文件，适合需要自定义CUDA kernel的场景。

**databricks/dolly**镜像是Databricks提供的Dolly大模型推理镜像，包含了优化过的推理服务和模型管理能力，适合企业级部署。

**deepinfra/deepstack**镜像是另一个值得关注的推理服务镜像，支持多种开源大模型，并提供了完善的OpenAI兼容API。

## Docker Compose编排多服务

### 模型服务与API网关编排

在大模型生产环境中，通常需要将推理服务封装成RESTful API对外提供服务。结合API网关可以实现负载均衡、流量控制、安全认证等功能。以下是一个完整的模型服务与API网关编排示例：

```yaml
# docker-compose.yml
version: '3.8'

services:
# vLLM推理服务
vllm-backend:
    image: vllm/vllm-openai:latest
    container_name: vllm-backend
    ports:
      - "8000:8000"
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - TOKENIZER=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.9
      - MAX_MODEL_LEN=8192
      - TENSOR_PARALLEL_SIZE=1
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s

# Nginx反向代理
api-gateway:
    image: nginx:alpine
   container_name: api-gateway
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./ssl:/etc/nginx/ssl:ro
    depends_on:
      vllm-backend:
        condition: service_healthy
    networks:
      - llm-net
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:80/health"]
      interval: 30s
      timeout: 10s
      retries: 3

# Prometheus监控
prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    networks:
      - llm-net

networks:
llm-net:
    driver: bridge

volumes:
model_cache:
prometheus_data:
```

对应的Nginx配置（nginx.conf）：

```nginx
events {
    worker_connections 1024;
}

http {
    upstream vllm_backend {
        server vllm-backend:8000;
        keepalive 64;
    }

    # 限流配置
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    server {
        listen 80;
        server_name _;

        location /health {
            return 200 'OK';
            add_header Content-Type text/plain;
        }

        location /v1/chat/completions {
            limit_req zone=api_limit burst=20 nodelay;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            # 流式响应配置
            proxy_buffering off;
            proxy_cache off;
          proxy_read_timeout 300s;
            proxy_connect_timeout 75s;

            # 处理Server-Sent Events
            proxy_set_header Connection '';
            chunked_transfer_encoding on;
        }

        location /v1/completions {
            limit_req zone=api_limit burst=20 nodelay;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
        }

        # 模型列表接口
        location /v1/models {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
        }

        # 静态资源（前端监控面板）
        location /dashboard {
            alias /usr/share/nginx/html;
            index index.html;
        }
    }
}
```

### 模型服务与Redis缓存编排

在大模型推理场景中，合理使用缓存可以显著降低响应延迟和减少GPU计算压力。Redis缓存主要用于存储对话历史、推理结果、模型元数据等信息。

```yaml
# docker-compose-with-cache.yml
version: '3.8'

services:
vllm-backend:
    image: vllm/vllm-openai:latest
    container_name: vllm-backend
  ports:
      - "8000:8000"
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.85
      - MAX_MODEL_LEN=8192
      - KV_CACHE_FP8_QUANTIZED=true
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    depends_on:
      redis-cache:
        condition: service_healthy

# Redis缓存服务
  redis-cache:
    image: redis:7-alpine
    container_name: redis-cache
    ports:
      - "6379:6379"
    command: redis-server --appendonly yes --maxmemory 2gb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    networks:
      - llm-net
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

# 应用服务（处理缓存逻辑）
app-service:
    image: python:3.10-slim
    container_name: app-service
    ports:
      - "8080:8080"
    environment:
      - REDIS_HOST=redis-cache
      - REDIS_PORT=6379
      - VLLM_HOST=vllm-backend
      - VLLM_PORT=8000
    volumes:
      - ./app:/app
    working_dir: /app
    command: python app.py
    networks:
      - llm-net
    depends_on:
      redis-cache:
        condition: service_healthy

networks:
llm-net:
    driver: bridge

volumes:
model_cache:
redis_data:
```

应用服务的缓存逻辑示例（app.py）：

```python
import os
import redis
import hashlib
import json
from flask import Flask, request, jsonify
import requests

app = Flask(__name__)

redis_client = redis.Redis(
    host=os.getenv('REDIS_HOST', 'localhost'),
    port=int(os.getenv('REDIS_PORT', 6379)),
    decode_responses=True
)

VLLM_URL = f"http://{os.getenv('VLLM_HOST', 'localhost')}:{os.getenv('VLLM_PORT', 8000)}/v1/chat/completions"

def generate_cache_key(messages, temperature, max_tokens):
    """生成缓存键"""
    cache_data = json.dumps({
        'messages': messages,
        'temperature': temperature,
        'max_tokens': max_tokens
    }, sort_keys=True)
    return f"llm:response:{hashlib.sha256(cache_data.encode()).hexdigest()}"

def check_cache(cache_key):
    """检查缓存"""
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    return None

def save_to_cache(cache_key, response_data, ttl=3600):
    """保存到缓存"""
    redis_client.setex(cache_key, ttl, json.dumps(response_data))

@app.route('/chat', methods=['POST'])
def chat():
    data = request.json
    messages = data.get('messages', [])
    temperature = data.get('temperature', 0.7)
    max_tokens = data.get('max_tokens', 2048)
    use_cache = data.get('use_cache', True)

    # 生成缓存键
    cache_key = generate_cache_key(messages, temperature, max_tokens)

    # 尝试从缓存获取
    if use_cache:
        cached_response = check_cache(cache_key)
        if cached_response:
            cached_response['cached'] = True
            return jsonify(cached_response)

    # 调用vLLM服务
    payload = {
        'model': 'Qwen/Qwen-7B-Chat',
        'messages': messages,
        'temperature': temperature,
        'max_tokens': max_tokens,
        'stream': False
    }

    try:
        response = requests.post(VLLM_URL, json=payload, timeout=120)
        response.raise_for_status()
        result = response.json()

        # 保存到缓存
        if use_cache:
            save_to_cache(cache_key, result)

        result['cached'] = False
        return jsonify(result)
    except requests.exceptions.RequestException as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)
```

图2
docker_compose.png

### 集群模式部署配置

对于大规模推理服务，需要支持多实例部署和负载均衡。以下是一个支持水平扩展的集群部署配置：

```yaml
# docker-compose-cluster.yml
version: '3.8'

services:
# vLLM推理实例1
vllm-1:
    image: vllm/vllm-openai:latest
    container_name: vllm-1
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.85
      - MAX_MODEL_LEN=8192
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    ports:
      - "8001:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
    timeout: 10s
      retries: 3
      start_period: 180s

# vLLM推理实例2
vllm-2:
    image: vllm/vllm-openai:latest
    container_name: vllm-2
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.85
      - MAX_MODEL_LEN=8192
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    ports:
      - "8002:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 180s

# HAProxy负载均衡
haproxy:
    image: haproxy:alpine
    container_name: haproxy
    ports:
      - "80:8080"
      - "8404:8404" # HAProxy stats page
    volumes:
      - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
    depends_on:
      - vllm-1
      - vllm-2
    networks:
      - llm-net
    healthcheck:
      test: ["CMD", "haproxy", "-c", "-f", "/usr/local/etc/haproxy/haproxy.cfg"]
      interval: 30s
      timeout: 10s
      retries: 3

# Redis Session存储
redis:
    image: redis:7-alpine
    container_name: redis-session
    command: redis-server --appendonly yes --maxmemory 4gb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    networks:
      - llm-net
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

networks:
llm-net:
    driver: bridge

volumes:
model_cache:
redis_data:
```

HAProxy配置（haproxy.cfg）：

```cfg
global
    log stdout format raw local0
    maxconn 4096
    stats socket /var/run/hapee-listen.sock mode 600 level admin
   stats timeout 2m

defaults
    log     global
    mode    http
    option httplog
    option dontlognull
    option http-server-close
    option forwardfor except 127.0.0.0/8
    timeout connect 30s
    timeout client 180s
    timeout server 300s
  timeout tunnel 600s
    timeout check   5s

# 统计页面
listen stats
    bind *:8404
    mode http
    stats enable
    stats uri /
    stats refresh 30s

# 推理服务负载均衡
frontend llm_frontend
    bind *:8080
    default_backend llm_backend

backend llm_backend
    mode http
    balance roundrobin
    option tcp-check
    tcp-check connect
    tcp-check expect string X-Request-Id

    # vLLM实例
    server vllm1 vllm-1:8000 check inter 30s rise 2 fall 3 weight 100
    server vllm2 vllm-2:8000 check inter 30s rise 2 fall 3 weight 100

    # 健康检查
    http-check expect status 200
    option httpchk GET /health
    http-check expect string OK

    # 响应超时
    timeout server 300s

    # 会话保持
    stick-table type string len 32 size 100k expire 30m
    stick on req.cookie(X-Session-Id)

# 监控端点
frontend monitoring
    bind *:8081
    mode http
    default_backend metrics_backend

backend metrics_backend
    mode http
    stats enable
    stats uri /stats
    stats realm Haproxy\ Statistics
    stats auth admin:admin
```

## 容器化性能优化

### 内存与显存配置优化

大模型推理是内存密集型任务，合理配置内存和显存参数对性能影响显著。以下是各主要推理框架的内存优化配置：

**vLLM内存配置：**

```yaml
environment:
- GPU_MEMORY_UTILIZATION=0.9        # GPU显存使用率（默认0.9）
- MAX_MODEL_LEN=8192                # 最大模型长度
- KV_CACHE_FP8_QUANTIZED=true      # KV缓存量化（需要GPU支持）
- BLOCK_SIZE=16                     # KV缓存块大小
- NUM_PREEMPTIVE_SAMPLES=4         # 抢占式采样数
```

`GPU_MEMORY_UTILIZATION`控制vLLM可以使用的GPU显存比例。默认值0.9预留10%显存用于KV缓存和其他开销。如果遇到OOM错误，可以适当降低到0.8或0.7。

**PyTorch内存配置：**

```python
import torch

# 开启梯度检查点以节省显存
model.gradient_checkpointing_enable()

# 设置PyTorch内存分配器策略
torch.cuda.set_per_process_memory_fraction(0.9)

# 启用TF32计算（Ampere架构GPU）
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
```

**Transformer模型内存优化：**

```python
from transformers import AutoModelForCausalLM

# 启用Flash Attention（需要安装flash-attn）
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    torch_dtype=torch.float16,
    use_flash_attention_2=True # 需要flash-attn包
)

# 4-bit量化加载
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    quantization_config=quantization_config,
    device_map="auto"
)
```

### GPU分配策略

在多GPU环境下，选择合适的GPU分配策略对整体吞吐量至关重要。

**单GPU单容器**是最简单的部署模式，适合模型可以完整加载到单张GPU的场景。配置简单，但GPU利用率可能不够高：

```yaml
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]
```

**多GPU张量并行**将单个模型的权重分布到多张GPU上，适合超大模型或需要极致吞吐量的场景：

```bash
# 使用vLLM的张量并行
docker run --gpus '"device=0,1,2,3"' \
    -e TENSOR_PARALLEL_SIZE=4 \
    vllm/vllm-openai:latest
```

**多实例负载均衡**在每张GPU上运行独立的模型实例，通过负载均衡器分发请求。这种方式的吞吐量通常高于张量并行，因为每个实例可以独立进行批量处理：

```yaml
services:
vllm-gpu0:
    # ...配置...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0

vllm-gpu1:
    # ...配置...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=1
```

### 批量推理优化

批量推理是提高GPU利用率的有效手段。合理的批处理配置可以将吞吐量提升数倍。

**动态批处理（Dynamic Batching）**由推理服务框架自动管理，自动将多个请求合并成批次处理：

```yaml
environment:
# vLLM动态批处理配置
- MAX_NUM_BATCHED_TOKENS=8192      # 单批次最大token数
- MAX_NUM_SEQUENCES=256            # 单批次最大序列数
- SERVING_NAME=Qwen-7B             # 模型名称
```

**连续批处理（Continuous Batching）**是vLLM的默认批处理策略，允许在生成过程中动态添加新请求，无需等待整个批次完成：

```python
# vLLM连续批处理示例
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen-7B-Chat")

# 批量推理请求
prompts = [
    "请介绍一下Docker: ",
    "什么是Kubernetes: ",
    "解释一下微服务架构: "
]

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["\n", "用户:", "assistant:"]
)

# 批量生成
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**预批处理（Prefix Caching）**对于具有相同系统提示的对话场景，可以复用已计算的KV缓存，显著降低首token延迟：

```yaml
environment:
- ENABLE_PREFIX_CACHING=true    # 启用前缀缓存
- NUM_SCHEDULER_STEPS=8          # 调度步数
```

### 推理延迟优化

在线推理场景对延迟敏感，需要综合优化以降低响应时间。

**流式输出**允许在生成过程中就开始返回token，显著降低用户的感知延迟：

```python
# 使用流式输出
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen-7B-Chat")
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
    stream=True # 启用流式输出
)

# 流式生成
for output in llm.generate(["请介绍一下人工智能的发展历史"], sampling_params):
    # output包含已生成的token
    print(output.outputs[0].text, end="", flush=True)
```

**KV缓存量化**可以显著减少显存占用，允许更大的批量或更长的上下文：

```yaml
environment:
- KV_CACHE_DTYPE=fp8              # KV缓存数据类型
```

**CUDA Graph**通过捕获和重放GPU操作来减少内核启动开销：

```yaml
environment:
- ENABLE_CHUNKED_PREFILL=false     # 禁用分块预填充以降低延迟
```

## Docker安全加固

### 容器安全基础配置

大模型推理服务通常处理敏感数据，容器安全加固至关重要。

**以非root用户运行容器**是基本的安全最佳实践：

```dockerfile
# Dockerfile中添加
RUN groupadd -r llmuser && useradd -r -g llmuser llmuser

# ... 安装依赖 ...

USER llmuser
WORKDIR /home/llmuser

CMD ["python", "app.py"]
```

docker-compose.yml中也可以指定运行用户：

```yaml
services:
vllm-backend:
    image: vllm/vllm-openai:latest
    user: "1000:1000" # 指定UID:GID
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
```

**限制容器能力**通过`cap_drop`移除不必要的Linux能力：

```yaml
services:
vllm-backend:
    image: vllm/vllm-openai:latest
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE # 仅添加需要的capability
```

**启用SELinux/AppArmor**为容器提供额外的访问控制：

```bash
# Docker运行时的AppArmor配置
docker run --security-opt "apparmor=docker-default" \
    --gpus all \
    vllm/vllm-openai:latest
```

### 网络安全配置

限制容器的网络访问可以减少攻击面。

**使用自定义网络**隔离容器通信：

```yaml
networks:
llm-internal:
    driver: bridge
    internal: true # 禁止容器访问外部网络

llm-public:
    driver: bridge
```

不同服务分配到不同网络层：

```yaml
services:
# 仅内部网络访问的推理服务
vllm-backend:
    networks:
      - llm-internal
    # 不暴露任何端口

# 可访问公网的API网关
api-gateway:
    networks:
      - llm-public
      - llm-internal
    ports:
      - "80:80"
```

**TLS/SSL加密**保护传输中的数据：

```yaml
services:
api-gateway:
    image: nginx:alpine
    volumes:
      - ./ssl/cert.pem:/etc/nginx/ssl/cert.pem:ro
      - ./ssl/key.pem:/etc/nginx/ssl/key.pem:ro
    command: |
      nginx -c /etc/nginx/nginx.conf -g 'daemon off;'
```

### 镜像安全扫描

构建和部署前扫描镜像漏洞是安全流程的重要环节。

**使用Trivy扫描镜像：**

```bash
# 安装Trivy
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    aquasec/trivy image vllm/vllm-openai:latest

# 扫描已知漏洞
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    aquasec/trivy image --severity HIGH,CRITICAL \
    vllm/vllm-openai:latest

# 输出漏洞报告
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    -v ./reports:/reports \
    aquasec/trivy image \
    --format json --output /reports/vuln-report.json \
    vllm/vllm-openai:latest
```

**Docker Scout**是Docker官方提供的镜像安全分析工具：

```bash
# 登录Docker Hub
docker scout login

# 分析镜像
docker scout cves vllm/vllm-openai:latest

# 比较两个镜像版本
docker scout compare \
    --from vllm/vllm-openai:0.1 \
    --to vllm/vllm-openai:0.2
```

**CI/CD集成安全扫描：**

```yaml
# .github/workflows/security-scan.yml
name: Security Scan

on: [push, pull_request]

jobs:
trivy-scan:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'vllm/vllm-openai:latest'
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'

      - name: Upload Trivy results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'
```

### 运行时安全监控

部署后的容器安全监控同样重要。

**Docker内置安全审计：**

```bash
# 查看容器运行状态
docker stats

# 检查容器权限
docker inspect --format '{{.HostConfig.Privileged}}' <container>

# 查看容器的Linux capabilities
docker inspect --format '{{.HostConfig.CapAdd}}' <container>
```

**Falco**是CNCF旗下的容器运行时安全监控工具：

```yaml
# docker-compose-falco.yml
version: '3.8'

services:
falco:
    image: falcosecurity/falco:latest
    container_name: falco
    privileged: true
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /host:/host:ro
      - ./falco-rules.yaml:/etc/falco/falco_rules.yaml:ro
    environment:
      - FALCO菲尔斯_ENABLED=true
    network_mode: host
```

Falco规则示例（falco-rules.yaml）：

```yaml
# 检测容器内运行shell
- rule: Shell spawned in container
desc: A shell was spawned by a non-shell program in a container
condition: >
  spawned_process and
    container and
    proc.name = bash
output: >
    Shell spawn detected (user=%user.name container_id=%container.id
    container_name=%container.name shell=%proc.name parent=%proc.pname
    cmdline=%proc.cmdline)
priority: NOTICE

# 检测特权容器
- rule: Privileged container started
desc: Privileged container started
condition: >
    container and
    container.privileged = true
output: >
    Privileged container started (id=%container.id
    image=%container.image.repository
   user=%user.name)
priority: CRITICAL
```

## Kubernetes部署大模型

### kubectl基础配置

Kubernetes（K8s）是生产级容器编排的事实标准，掌握kubectl基础操作是部署大模型的必备技能。

**集群上下文配置：**

```bash
# 查看当前集群上下文
kubectl config current-context

# 列出所有上下文
kubectl config get-contexts

# 切换上下文
kubectl config use-context production-cluster

# 查看集群信息
kubectl cluster-info
```

**Namespace管理：**

```bash
# 创建命名空间
kubectl create namespace llm-production

# 切换默认命名空间
kubectl config set-context --current --namespace=llm-production

# 在指定命名空间操作
kubectl get pods -n llm-production
```

**Deployment创建：**

```yaml
# llm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen-7b-deployment
namespace: llm-production
labels:
    app: qwen-7b
    version: v1
spec:
replicas: 2
selector:
    matchLabels:
      app: qwen-7b
template:
    metadata:
      labels:
        app: qwen-7b
        version: v1
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: MODEL_NAME
          value: "Qwen/Qwen-7B-Chat"
        - name: GPU_MEMORY_UTILIZATION
          value: "0.85"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "4"
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: CUDA_VISIBLE_DEVICES
          value: "all"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      nodeSelector:
        gpu-type: nvidia-rtx4090
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
```

**Service创建：**

```yaml
# llm-service.yaml
apiVersion: v1
kind: Service
metadata:
name: qwen-7b-service
namespace: llm-production
spec:
type: ClusterIP
selector:
    app: qwen-7b
ports:
- name: http
    port: 8000
    targetPort: 8000
    protocol: TCP
```

**应用配置：**

```bash
kubectl apply -f llm-deployment.yaml
kubectl apply -f llm-service.yaml

# 查看部署状态
kubectl get deployments -n llm-production
kubectl get pods -n llm-production

# 查看日志
kubectl logs -n llm-production -l app=qwen-7b --tail=100

# 扩缩容
kubectl scale deployment qwen-7b-deployment --replicas=4 -n llm-production

# 更新镜像版本
kubectl set image deployment/qwen-7b-deployment \
    vllm=vllm/vllm-openai:v0.2.0 -n llm-production

# 回滚
kubectl rollout undo deployment/qwen-7b-deployment -n llm-production
```

### Helm Chart部署

Helm是Kubernetes的包管理工具，使用Helm Chart可以简化大模型服务的部署和配置管理。

**安装Helm：**

```bash
# Linux/macOS
curl -fsSL https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz | tar -xz
sudo mv linux-amd64/helm /usr/local/bin/helm

# Windows (使用Chocolatey)
choco install kubernetes-helm
```

**创建Helm Chart：**

```bash
helm create llm-service
cd llm-service
```

**Chart目录结构：**

```
llm-service/
├── Chart.yaml
├── values.yaml
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   └── _helpers.tpl
└── .helmignore
```

**Chart.yaml：**

```yaml
apiVersion: v2
name: llm-service
description: A Helm chart for LLM inference service
type: application
version: 1.0.0
appVersion: "latest"
keywords:
- llm
- vllm
- inference
maintainers:
- name: LLM Team
    email: llm-team@example.com
dependencies:
- name: common
    version: "1.x.x"
    repository: "https://charts.bitnami.com/bitnami"
```

**values.yaml：**

```yaml
replicaCount: 2

image:
repository: vllm/vllm-openai
tag: latest
pullPolicy: IfNotPresent

service:
type: ClusterIP
port: 8000

resources:
limits:
    nvidia.com/gpu: 1
    memory: 32Gi
    cpu: 8
requests:
    nvidia.com/gpu: 1
    memory: 16Gi
    cpu: 4

env:
MODEL_NAME: "Qwen/Qwen-7B-Chat"
GPU_MEMORY_UTILIZATION: "0.85"
MAX_MODEL_LEN: "8192"

autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80

ingress:
enabled: true
className: nginx
annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
    - host: llm.example.com
      paths:
        - path: /
          pathType: Prefix
tls:
    - secretName: llm-tls
      hosts:
        - llm.example.com

persistence:
enabled: true
storageClass: nfs-client
size: 100Gi

nodeSelector:
gpu-type: nvidia-rtx4090

tolerations:
- key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

affinity:
podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - llm-service
          topologyKey: kubernetes.io/hostname
```

**部署Helm Chart：**

```bash
# 添加依赖
helm dependency build llm-service

# 部署
helm install qwen-7b llm-service \
    --namespace llm-production \
    --create-namespace \
    --values llm-service/values.yaml

# 升级
helm upgrade qwen-7b llm-service \
    --namespace llm-production \
    --values llm-service/values-prod.yaml

# 查看部署状态
helm status qwen-7b -n llm-production

# 列出部署
helm list -n llm-production

# 删除部署
helm uninstall qwen-7b -n llm-production
```

### GPU调度与资源管理

Kubernetes原生支持GPU调度，但需要正确配置才能生效。

**检查GPU资源：**

```bash
# 查看节点GPU信息
kubectl describe nodes | grep -A 10 "nvidia.com/gpu"

# 查看GPU可用数量
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
```

**GPU配额管理：**

```yaml
# ResourceQuota限制命名空间GPU使用
apiVersion: v1
kind: ResourceQuota
metadata:
name: llm-gpu-quota
namespace: llm-production
spec:
hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"
---
# LimitRange设置默认值
apiVersion: v1
kind: LimitRange
metadata:
name: llm-limits
namespace: llm-production
spec:
limits:
- max:
      nvidia.com/gpu: 2
    default:
      nvidia.com/gpu: 1
    defaultRequest:
      nvidia.com/gpu: 1
    type: Container
```

**时间片调度（Time Slicing）：** 在共享GPU集群中，可以让多个容器分时使用同一张GPU：

```yaml
# device-plugin配置
apiVersion: apps/v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-config
namespace: gpu-operator
data:
config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4 # 将1张GPU划分为4个时间片
```

### 监控与日志集成

生产环境需要完善的监控和日志系统。

**Prometheus监控：**

```yaml
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'vllm'
      static_configs:
      - targets: ['vllm-service.llm-production:8000']
      metrics_path: /metrics
```

**Grafana仪表板：**

```yaml
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-grafana-dashboard
namespace: monitoring
labels:
    grafana_dashboard: "1"
data:
vllm-dashboard.json: |
    {
      "dashboard": {
        "title": "vLLM Inference Dashboard",
        "panels": [
          {
            "title": "Requests per Second",
            "type": "graph",
          "targets": [
              {
                "expr": "rate(vllm_requests_total[5m])",
                "legendFormat": "{{model}}"
              }
            ]
          },
          {
            "title": "Average Latency",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m]))",
                "legendFormat": "p95"
              }
            ]
          },
          {
            "title": "GPU Memory Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "vllm_gpu_memory_usage_bytes / vllm_gpu_memory_total_bytes",
                "legendFormat": "Used {{gpu}}"
              }
           ]
          }
        ]
      }
    }
```

## 实战案例：docker-compose一键部署Qwen-7B推理服务

图4
deployment_flow.png

### 环境准备

在开始部署前，需要确保环境满足以下要求：

**硬件要求：** NVIDIA GPU（至少16GB显存，推荐RTX 4080或更高），8核以上CPU，32GB系统内存，100GB可用磁盘空间。

**软件要求：** Ubuntu 20.04+或CentOS 8+，NVIDIA Driver 535+，Docker 24+，NVIDIA Container Toolkit。

**环境验证：**

```bash
# 检查NVIDIA驱动
nvidia-smi
# 预期输出应显示GPU型号和驱动版本

# 检查Docker
docker --version
# 预期输出：Docker version 24.x.x

# 检查nvidia-container-toolkit
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
# 预期输出应显示GPU信息
```

### 编写docker-compose配置

创建项目目录并编写docker-compose.yml：

```bash
mkdir -p ~/llm-deploy/qwen-7b
cd ~/llm-deploy/qwen-7b
```

```yaml
# docker-compose.yml
version: '3.8'

services:
# Qwen-7B推理服务
qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    hostname: qwen7b-inference
    ports:
      - "8000:8000"
    environment:
      # 模型配置
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - TOKENIZER=Qwen/Qwen-7B-Chat
      # 性能优化
      - GPU_MEMORY_UTILIZATION=0.9
      - MAX_MODEL_LEN=8192
      - MAX_NUM_BATCHED_TOKENS=8192
      - MAX_NUM_SEQUENCES=256
      # 推理优化
      - ENABLE_CHUNKED_PREFILL=true
      - ENABLE_PREFIX_CACHING=true
      # GPU配置
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=all
      # 日志配置
      - LOG_LEVEL=INFO
    volumes:
      # HuggingFace模型缓存（避免重复下载）
      - model_cache:/root/.cache/huggingface
      # vLLM服务日志
      - ./logs:/var/log/vllm
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb' # 共享内存（对vLLM重要）
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    networks:
      - qwen-net

# API网关（Nginx）
api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
      - "443:443"
    volumes:
    - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      qwen7b-inference:
        condition: service_healthy
    restart: unless-stopped
    networks:
      - qwen-net

# Prometheus监控
prometheus:
  image: prom/prometheus:latest
    container_name: qwen7b-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    restart: unless-stopped
    networks:
      - qwen-net

# Grafana可视化
grafana:
    image: grafana/grafana:latest
    container_name: qwen7b-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - ./grafana/data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped
    networks:
      - qwen-net

networks:
qwen-net:
    driver: bridge

volumes:
model_cache:
```

### Nginx配置

```bash
mkdir -p nginx ssl prometheus grafana/provisioning/datasources grafana/provisioning/dashboards grafana/dashboards logs
```

```nginx
# nginx/nginx.conf
events {
    worker_connections 1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type application/octet-stream;

    # 日志格式
    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$http_x_forwarded_for" '
                    'rt=$request_time uct="$upstream_connect_time" '
                    'uht="$upstream_header_time" urt="$upstream_response_time"';

    access_log /var/log/nginx/access.log main;
    error_log /var/log/nginx/error.log warn;

    # 性能优化
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    # Gzip压缩
    gzip on;
   gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_types text/plain text/css text/xml application/json application/javascript application/rss+xml application/atom+xml image/svg+xml;

    # 限流配置
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/s;
    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

    upstream vllm_backend {
        server qwen7b-inference:8000;
        keepalive 64;
    }

    server {
        listen 80;
        server_name _;

        # 健康检查
      location /health {
            access_log off;
            return 200 "OK";
            add_header Content-Type text/plain;
        }

        # OpenAI兼容API - Chat Completions
        location /v1/chat/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            # 代理头设置
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            # 流式响应（关键配置）
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_connect_timeout 60s;
            proxy_send_timeout 300s;

            # SSE必需配置
            proxy_set_header Connection '';
            chunked_transfer_encoding on;

            # 记录响应时间
            add_header X-Response-Time $upstream_response_time;
        }

        # OpenAI兼容API - Completions
        location /v1/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_set_header Connection '';
            chunked_transfer_encoding on;
        }

        # 模型列表
        location /v1/models {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }

        # vLLM直接接口
        location /v1 {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }

        # Prometheus指标（仅内网访问）
        location /metrics {
            proxy_pass http://prometheus:9090;
            proxy_set_header Host $host;
            allow 127.0.0.1;
            allow 10.0.0.0/8;
            allow 172.16.0.0/12;
            deny all;
        }
    }
}
```

### Prometheus配置

```yaml
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'vllm'
    static_configs:
      - targets: ['qwen7b-inference:8000']
    metrics_path: '/metrics'
    scrape_interval: 10s
```

### 启动服务

```bash
# 进入项目目录
cd ~/llm-deploy/qwen-7b

# 拉取镜像（提前完成，节省首次启动时间）
docker-compose pull

# 启动服务（后台运行）
docker-compose up -d

# 查看服务状态
docker-compose ps

# 查看日志
docker-compose logs -f qwen7b-inference

# 等待服务就绪后测试
# 首次启动需要下载模型，可能需要几分钟
sleep 120

# 健康检查
curl http://localhost:8000/health

# 模型列表
curl http://localhost:8000/v1/models

# 测试推理
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "Qwen/Qwen-7B-Chat",
    "messages": [
      {"role": "system", "content": "你是一个有帮助的AI助手。"},
      {"role": "user", "content": "请介绍一下Docker容器化技术的优势。"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
}'
```

### 性能测试

```bash
# 安装测试工具
pip install locust

# 创建负载测试脚本 locustfile.py
cat > locustfile.py << 'EOF'
from locust import HttpUser, task, between
import json

class LLMUser(HttpUser):
    wait_time = between(1, 3)
    host = "http://localhost"

    @task(3)
    def chat_completion(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "system", "content": "你是一个有帮助的AI助手。"},
                {"role": "user", "content": "请用100字介绍一下人工智能。"}
            ],
            "max_tokens": 200,
            "temperature": 0.7,
            "stream": False
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")

    @task(1)
    def chat_completion_stream(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "user", "content": "解释一下什么是机器学习。"}
            ],
            "max_tokens": 256,
            "temperature": 0.7,
            "stream": True
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")
EOF

# 运行负载测试（10个并发用户，持续60秒）
locust -f locustfile.py --headless -u 10 -r 2 -t 60s --csv results

# Web界面测试（可选）
locust -f locustfile.py --web-port 8089
```

## 容器化后的Spring Boot对接方案

### 概述

在企业级应用中，Java/Spring Boot仍然是主流的后端开发框架。将大模型推理服务容器化部署后，如何在Spring Boot应用中对接使用是开发者普遍关心的问题。

### RestTemplate对接方案

最直接的对接方式是使用Spring的RestTemplate或WebClient调用大模型API：

```java
package com.example.llm.config;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.http.client.SimpleClientHttpRequestFactory;
import org.springframework.web.client.RestTemplate;

@Configuration
public class RestTemplateConfig {

    @Bean
    public RestTemplate restTemplate() {
        SimpleClientHttpRequestFactory factory = new SimpleClientHttpRequestFactory();
        factory.setConnectTimeout(10000);      // 连接超时10秒
        factory.setReadTimeout(120000);         // 读取超时120秒（大模型推理可能较长）
        factory.setBufferRequestBody(true);

        return new RestTemplate(factory);
  }
}
```

### OpenAI兼容客户端封装

为了方便对接支持OpenAI协议的大模型服务，封装一个统一的客户端：

```java
package com.example.llm.client;

import com.fasterxml.jackson.annotation.JsonProperty;
import lombok.Data;
import lombok.Builder;
import lombok.AllArgsConstructor;
import lombok.NoArgsConstructor;

import java.util.List;
import java.util.Map;

public class LLMClient {

    private final RestTemplate restTemplate;
    private final String baseUrl;

    public LLMClient(RestTemplate restTemplate, String baseUrl) {
        this.restTemplate = restTemplate;
        this.baseUrl = baseUrl;
    }

    /**
     * 聊天补全
     */
    public ChatCompletionResponse chatCompletion(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ResponseEntity<ChatCompletionResponse> response = restTemplate.exchange(
            url,
            HttpMethod.POST,
            entity,
            ChatCompletionResponse.class
        );

        return response.getBody();
    }

    /**
     * 流式聊天补全
     */
    public Flux<String> chatCompletionStream(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);
        headers.setAccept(MediaType.TEXT_EVENT_STREAM);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ParameterizedTypeReference<ServerSentEvent<String>> type = new ParameterizedTypeReference<>() {};

        return restTemplate.exchange(
                url,
                HttpMethod.POST,
                entity,
                new ParameterizedTypeReference<ServerSentEvent<String>>() {}
            )
            .getBody()
            .stream()
            .map(ServerSentEvent::data);
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionRequest {
        private String model;
        private List<Message> messages;
        private Double temperature;
        @JsonProperty("max_tokens")
        private Integer maxTokens;
       @JsonProperty("top_p")
        private Double topP;
        private Boolean stream;
        private List<String> stop;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Message {
        private String role;
        private String content;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionResponse {
        private String id;
        private String object;
        private Long created;
        private String model;
        private List<Choice> choices;
        private Usage usage;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Choice {
        private Integer index;
        private Message message;
        @JsonProperty("finish_reason")
        private String finishReason;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Usage {
        @JsonProperty("prompt_tokens")
        private Integer promptTokens;
        @JsonProperty("completion_tokens")
        private Integer completionTokens;
        @JsonProperty("total_tokens")
        private Integer totalTokens;
    }
}
```

### Spring Boot服务实现

```java
package com.example.llm.service;

import com.example.llm.client.LLMClient.ChatCompletionRequest;
import com.example.llm.client.LLMClient.ChatCompletionResponse;
import com.example.llm.client.LLMClient.LLMClient;
import com.example.llm.client.LLMClient.Message;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;

import java.util.Arrays;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;

@Service
@RequiredArgsConstructor
@Slf4j
public class LLMService {

    private final LLMClient llmClient;

    /**
     * 同步对话
     */
    public String chat(String userMessage) {
        return chat(userMessage, null);
    }

    /**
     * 同步对话（带系统提示）
     */
    public String chat(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, false);

        try {
            ChatCompletionResponse response = llmClient.chatCompletion(request);

            if (response != null && response.getChoices() != null && !response.getChoices().isEmpty()) {
                return response.getChoices().get(0).getMessage().getContent();
            }
        } catch (Exception e) {
            log.error("LLM调用失败", e);
            throw new RuntimeException("LLM服务调用失败: " + e.getMessage(), e);
        }

        return null;
    }

    /**
     * 流式对话
     */
    public Flux<String> chatStream(String userMessage) {
        return chatStream(userMessage, null);
    }

    /**
     * 流式对话（带系统提示）
     */
    public Flux<String> chatStream(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, true);

        return llmClient.chatCompletionStream(request)
            .filter(data -> !data.startsWith("[DONE]"))
            .map(data -> {
                // 解析SSE数据
                if (data.startsWith("data:")) {
                    data = data.substring(5).trim();
                }
                return data;
            });
    }

    private ChatCompletionRequest buildRequest(String userMessage, String systemPrompt, boolean stream) {
        List<Message> messages = Arrays.asList(
            Message.builder()
                .role("system")
                .content(systemPrompt != null ? systemPrompt : "你是一个有帮助的AI助手。")
                .build(),
            Message.builder()
                .role("user")
                .content(userMessage)
                .build()
        );

        return ChatCompletionRequest.builder()
            .model("Qwen/Qwen-7B-Chat")
            .messages(messages)
            .temperature(0.7)
            .maxTokens(2048)
            .stream(stream)
            .build();
    }

    /**
     * 异步对话
     */
    public CompletableFuture<String> chatAsync(String userMessage) {
        return CompletableFuture.supplyAsync(() -> chat(userMessage))
            .orTimeout(120, TimeUnit.SECONDS);
    }

    /**
     * 批量对话
     */
    public List<String> batchChat(List<String> messages) {
        return messages.parallelStream()
            .map(this::chat)
            .toList();
    }
}
```

### 控制器实现

```java
package com.example.llm.controller;

import com.example.llm.service.LLMService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.MediaType;
import org.springframework.http.codec.ServerSentEvent;
import org.springframework.web.bind.annotation.*;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;

import java.util.HashMap;
import java.util.Map;

@RestController
@RequestMapping("/api/llm")
@RequiredArgsConstructor
@Slf4j
public class LLMController {

    private final LLMService llmService;

    /**
     * 同步对话接口
     */
    @PostMapping("/chat")
    public Map<String, Object> chat(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("收到聊天请求: {}", userMessage);

        String response = llmService.chat(userMessage, systemPrompt);

        Map<String, Object> result = new HashMap<>();
        result.put("code", 200);
        result.put("message", "success");
        result.put("data", response);

        return result;
    }

    /**
     * 流式对话接口
     */
    @PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<ServerSentEvent<String>> chatStream(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("收到流式聊天请求: {}", userMessage);

        return llmService.chatStream(userMessage, systemPrompt)
            .map(data -> ServerSentEvent.<String>builder()
                .data(data)
                .build()
            )
            .onErrorResume(e -> {
                log.error("流式对话异常", e);
                return Mono.just(ServerSentEvent.<String>builder()
                    .data("[ERROR]" + e.getMessage())
                    .build());
            })
            .concatWith(Mono.just(ServerSentEvent.<String>builder()
                .data("[DONE]")
                .build()));
    }

    /**
     * 异步对话接口
     */
    @PostMapping("/chat/async")
    public Map<String, Object> chatAsync(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");

        log.info("收到异步聊天请求: {}", userMessage);

        // 立即返回，后续异步处理
        llmService.chatAsync(userMessage)
            .thenAccept(response -> {
                log.info("异步任务完成: {}", response);
                // 这里可以添加回调逻辑，如发送WebSocket消息
            })
            .exceptionally(e -> {
                log.error("异步任务失败", e);
                return null;
            });

        Map<String, Object> result = new HashMap<>();
        result.put("code", 202);
        result.put("message", "请求已接受，正在异步处理");

        return result;
    }
}
```

### 熔断与重试配置

```java
package com.example.llm.config;

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.retry.RetryConfig;
import lombok.extern.slf4j.Slf4j;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.client.RestTemplate;

import java.time.Duration;

@Configuration
@Slf4j
public class ResilienceConfig {

    @Bean
    public CircuitBreaker llmCircuitBreaker() {
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)                    // 失败率阈值50%
            .slowCallRateThreshold(80)                    // 慢调用率阈值80%
            .slowCallDurationThreshold(Duration.ofSeconds(30)) // 慢调用阈值30秒
            .waitDurationInOpenState(Duration.ofMinutes(1))     // 熔断器开放时间1分钟
            .permittedNumberOfCallsInHalfOpenState(3)    // 半开状态允许调用次数
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .slidingWindowSize(10)                        // 滑动窗口大小
            .minimumNumberOfCalls(5)                      // 最小调用次数
            .build();

        return CircuitBreaker.of("llmService", config);
    }

    @Bean
    public Retry llmRetry() {
        RetryConfig config = RetryConfig.custom()
            .maxAttempts(3)                               // 最大重试次数
            .waitDuration(Duration.ofSeconds(2))          // 重试间隔
            .retryExceptions(Exception.class)             // 重试的异常类型
            .ignoreExceptions(IllegalArgumentException.class) // 忽略的异常
            .build();

        return Retry.of("llmService", config);
    }
}
```

### 应用配置

```yaml
# application.yml
spring:
application:
    name: llm-spring-boot-client

server:
port: 8080

llm:
# 大模型服务地址
base-url: http://localhost:80
# 连接超时（毫秒）
connect-timeout: 10000
# 读取超时（毫秒）
read-timeout: 120000

# 熔断器配置
resilience4j:
circuitbreaker:
    configs:
      default:
        failureRateThreshold: 50
        waitDurationInOpenState: 60s
    instances:
      llmService:
        registerHealthIndicator: true
        slidingWindowSize: 10

# 日志配置
logging:
level:
    com.example.llm: DEBUG
    org.springframework.web.client: DEBUG
```

### Docker化Spring Boot应用

如果需要将Spring Boot应用也容器化部署，可以编写Dockerfile：

```dockerfile
# Multi-stage build
FROM eclipse-temurin:17-jdk-alpine AS builder

WORKDIR /app

# 复制Maven配置
COPY pom.xml .
COPY src ./src

# 下载依赖（利用缓存）
RUN ./mvnw dependency:go-offline -B

# 构建
RUN ./mvnw package -DskipTests -B

# 运行阶段
FROM eclipse-temurin:17-jre-alpine

WORKDIR /app

# 安全配置
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
RUN chown -R appuser:appgroup /app

# 复制jar包
COPY --from=builder /app/target/*.jar app.jar

# 复制模型缓存（可选，共享HuggingFace缓存）
VOLUME ["/root/.cache/huggingface"]

USER appuser

EXPOSE 8080

ENTRYPOINT ["java", "-jar", "-Xms2g", "-Xmx4g", "-XX:+UseG1GC", "app.jar"]
```

相应的docker-compose配置：

```yaml
version: '3.8'

services:
spring-boot-app:
    build: ./spring-boot-app
    container_name: spring-boot-llm-client
    ports:
      - "8080:8080"
    environment:
      - LLM_BASE_URL=http://qwen7b-gateway:80
      - SPRING_PROFILES_ACTIVE=production
    depends_on:
      - api-gateway
    networks:
      - qwen-net

api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - qwen7b-inference
    networks:
      - qwen-net

qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.9
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb'
    networks:
      - qwen-net

networks:
qwen-net:
    driver: bridge

volumes:
model_cache:
```

## 总结

Docker容器化技术为大模型的部署提供了标准化、可复制、易扩展的解决方案。通过本文的详细介绍，我们已经覆盖了从环境配置、镜像制作、服务编排到生产部署的完整流程。

**核心要点回顾：**

1. **环境配置**：正确安装nvidia-container-toolkit是GPU容器化的前提，CUDA版本兼容性需要仔细规划。

2. **镜像制作**：选择合适的Base Image，合理利用多阶段构建减小镜像体积，通过层缓存优化构建速度。

3. **服务编排**：Docker Compose适合单主机多容器编排，Kubernetes适合大规模生产部署。

4. **性能优化**：合理的内存配置、批量推理、流式输出等优化手段可显著提升推理吞吐量。

5. **安全加固**：以非root用户运行、限制容器能力、启用TLS加密、进行安全扫描是基本的安全实践。

6. **Spring Boot对接**：通过封装统一的API客户端，Spring Boot应用可以方便地调用容器化部署的大模型服务。

随着容器化和云原生技术的持续发展，大模型的部署将变得更加简单高效。希望本文能够为你的大模型生产部署提供有价值的参考。

▲ 图3

大模型Docker镜像制作

Base Image选择策略

选择合适的基础镜像是构建大模型Docker镜像的第一步。NVIDIA官方维护的CUDA基础镜像是最佳选择，它们经过严格测试，包含了经过验证的驱动和库组合。

**Ubuntu基础镜像 vs CentOS基础镜像：** Ubuntu是目前主流的Docker镜像操作系统，拥有更广泛的软件包支持和更快的安全更新。对于需要频繁更新安全补丁的生产环境，Ubuntu是更好的选择。CentOS则以其稳定性著称，适合对系统一致性要求极高的企业级场景。

**-devel镜像 vs runtime镜像：** NVIDIA CUDA镜像分为`devel`和`runtime`两个版本。`devel`镜像包含完整的编译工具链（gcc、nvcc等），体积较大（约十几GB），适合需要编译CUDA程序或构建依赖CUDA的程序场景。`runtime`镜像只包含运行时库，体积较小（约5GB），适合直接运行已编译好的CUDA程序。对于大模型推理服务，推荐使用`runtime`镜像以减小镜像体积和攻击面。

** alpine基础镜像的局限性：** 部分大模型官方镜像使用alpine作为基础系统以减小体积。但alpine的libc为musl而非glibc，可能导致某些Python扩展（如numpy、scipy的底层C库）编译或运行问题。如果遇到奇怪的动态链接错误，可能需要切换到Ubuntu或CentOS基础镜像。

CUDA版本与依赖环境配置

构建大模型Docker镜像时，需要在Dockerfile中显式指定CUDA版本和必要的依赖环境。以下是一个典型的Dockerfile结构：

### 集群模式部署配置

对于大规模推理服务，需要支持多实例部署和负载均衡。以下是一个支持水平扩展的集群部署配置：

```yaml
# docker-compose-cluster.yml
version: '3.8'

services:
# vLLM推理实例1
vllm-1:
    image: vllm/vllm-openai:latest
    container_name: vllm-1
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.85
      - MAX_MODEL_LEN=8192
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    ports:
      - "8001:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 180s

# vLLM推理实例2
vllm-2:
    image: vllm/vllm-openai:latest
    container_name: vllm-2
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.85
      - MAX_MODEL_LEN=8192
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    ports:
      - "8002:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 180s

# HAProxy负载均衡
haproxy:
    image: haproxy:alpine
    container_name: haproxy
    ports:
      - "80:8080"
      - "8404:8404" # HAProxy stats page
    volumes:
      - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
    depends_on:
      - vllm-1
      - vllm-2
    networks:
      - llm-net
    healthcheck:
      test: ["CMD", "haproxy", "-c", "-f", "/usr/local/etc/haproxy/haproxy.cfg"]
      interval: 30s
      timeout: 10s
      retries: 3

# Redis Session存储
redis:
    image: redis:7-alpine
    container_name: redis-session
    command: redis-server --appendonly yes --maxmemory 4gb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    networks:
      - llm-net
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

networks:
llm-net:
    driver: bridge

volumes:
model_cache:
redis_data:
```

HAProxy配置（haproxy.cfg）：

```cfg
global
    log stdout format raw local0
    maxconn 4096
    stats socket /var/run/hapee-listen.sock mode 600 level admin
    stats timeout 2m

defaults
    log     global
    mode    http
    option httplog
    option dontlognull
    option http-server-close
    option forwardfor except 127.0.0.0/8
    timeout connect 30s
    timeout client 180s
    timeout server 300s
    timeout tunnel 600s
    timeout check   5s

# 统计页面
listen stats
    bind *:8404
    mode http
    stats enable
    stats uri /
    stats refresh 30s

# 推理服务负载均衡
frontend llm_frontend
    bind *:8080
    default_backend llm_backend

backend llm_backend
    mode http
    balance roundrobin
    option tcp-check
    tcp-check connect
    tcp-check expect string X-Request-Id

    # vLLM实例
    server vllm1 vllm-1:8000 check inter 30s rise 2 fall 3 weight 100
    server vllm2 vllm-2:8000 check inter 30s rise 2 fall 3 weight 100

    # 健康检查
    http-check expect status 200
    option httpchk GET /health
    http-check expect string OK

    # 响应超时
    timeout server 300s

    # 会话保持
    stick-table type string len 32 size 100k expire 30m
    stick on req.cookie(X-Session-Id)

# 监控端点
frontend monitoring
    bind *:8081
    mode http
    default_backend metrics_backend

backend metrics_backend
    mode http
    stats enable
    stats uri /stats
    stats realm Haproxy\ Statistics
    stats auth admin:admin
```

## 容器化性能优化

### 内存与显存配置优化

大模型推理是内存密集型任务，合理配置内存和显存参数对性能影响显著。以下是各主要推理框架的内存优化配置：

**vLLM内存配置：**

```yaml
environment:
- GPU_MEMORY_UTILIZATION=0.9       # GPU显存使用率（默认0.9）
- MAX_MODEL_LEN=8192                # 最大模型长度
- KV_CACHE_FP8_QUANTIZED=true      # KV缓存量化（需要GPU支持）
- BLOCK_SIZE=16                     # KV缓存块大小
- NUM_PREEMPTIVE_SAMPLES=4         # 抢占式采样数
```

`GPU_MEMORY_UTILIZATION`控制vLLM可以使用的GPU显存比例。默认值0.9预留10%显存用于KV缓存和其他开销。如果遇到OOM错误，可以适当降低到0.8或0.7。

**PyTorch内存配置：**

```python
import torch

# 开启梯度检查点以节省显存
model.gradient_checkpointing_enable()

# 设置PyTorch内存分配器策略
torch.cuda.set_per_process_memory_fraction(0.9)

# 启用TF32计算（Ampere架构GPU）
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
```

**Transformer模型内存优化：**

```python
from transformers import AutoModelForCausalLM

# 启用Flash Attention（需要安装flash-attn）
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    torch_dtype=torch.float16,
    use_flash_attention_2=True # 需要flash-attn包
)

# 4-bit量化加载
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    quantization_config=quantization_config,
    device_map="auto"
)
```

### GPU分配策略

在多GPU环境下，选择合适的GPU分配策略对整体吞吐量至关重要。

**单GPU单容器**是最简单的部署模式，适合模型可以完整加载到单张GPU的场景。配置简单，但GPU利用率可能不够高：

```yaml
deploy:
resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]
```

**多GPU张量并行**将单个模型的权重分布到多张GPU上，适合超大模型或需要极致吞吐量的场景：

```bash
# 使用vLLM的张量并行
docker run --gpus '"device=0,1,2,3"' \
    -e TENSOR_PARALLEL_SIZE=4 \
    vllm/vllm-openai:latest
```

**多实例负载均衡**在每张GPU上运行独立的模型实例，通过负载均衡器分发请求。这种方式的吞吐量通常高于张量并行，因为每个实例可以独立进行批量处理：

```yaml
services:
vllm-gpu0:
    # ...配置...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0

vllm-gpu1:
    # ...配置...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=1
```

### 批量推理优化

批量推理是提高GPU利用率的有效手段。合理的批处理配置可以将吞吐量提升数倍。

**动态批处理（Dynamic Batching）**由推理服务框架自动管理，自动将多个请求合并成批次处理：

```yaml
environment:
# vLLM动态批处理配置
- MAX_NUM_BATCHED_TOKENS=8192      # 单批次最大token数
- MAX_NUM_SEQUENCES=256            # 单批次最大序列数
- SERVING_NAME=Qwen-7B             # 模型名称
```

**连续批处理（Continuous Batching）**是vLLM的默认批处理策略，允许在生成过程中动态添加新请求，无需等待整个批次完成：

```python
# vLLM连续批处理示例
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen-7B-Chat")

# 批量推理请求
prompts = [
    "请介绍一下Docker: ",
    "什么是Kubernetes: ",
    "解释一下微服务架构: "
]

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["\n", "用户:", "assistant:"]
)

# 批量生成
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**预批处理（Prefix Caching）**对于具有相同系统提示的对话场景，可以复用已计算的KV缓存，显著降低首token延迟：

```yaml
environment:
- ENABLE_PREFIX_CACHING=true    # 启用前缀缓存
- NUM_SCHEDULER_STEPS=8          # 调度步数
```

### 推理延迟优化

在线推理场景对延迟敏感，需要综合优化以降低响应时间。

**流式输出**允许在生成过程中就开始返回token，显著降低用户的感知延迟：

```python
# 使用流式输出
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen-7B-Chat")
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
    stream=True # 启用流式输出
)

# 流式生成
for output in llm.generate(["请介绍一下人工智能的发展历史"], sampling_params):
    # output包含已生成的token
    print(output.outputs[0].text, end="", flush=True)
```

**KV缓存量化**可以显著减少显存占用，允许更大的批量或更长的上下文：

```yaml
environment:
- KV_CACHE_DTYPE=fp8              # KV缓存数据类型
```

**CUDA Graph**通过捕获和重放GPU操作来减少内核启动开销：

```yaml
environment:
- ENABLE_CHUNKED_PREFILL=false     # 禁用分块预填充以降低延迟
```

## Docker安全加固

### 容器安全基础配置

大模型推理服务通常处理敏感数据，容器安全加固至关重要。

**以非root用户运行容器**是基本的安全最佳实践：

```dockerfile
# Dockerfile中添加
RUN groupadd -r llmuser && useradd -r -g llmuser llmuser

# ... 安装依赖 ...

USER llmuser
WORKDIR /home/llmuser

CMD ["python", "app.py"]
```

docker-compose.yml中也可以指定运行用户：

```yaml
services:
vllm-backend:
    image: vllm/vllm-openai:latest
    user: "1000:1000" # 指定UID:GID
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
```

**限制容器能力**通过`cap_drop`移除不必要的Linux能力：

```yaml
services:
vllm-backend:
    image: vllm/vllm-openai:latest
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE # 仅添加需要的capability
```

**启用SELinux/AppArmor**为容器提供额外的访问控制：

```bash
# Docker运行时的AppArmor配置
docker run --security-opt "apparmor=docker-default" \
    --gpus all \
    vllm/vllm-openai:latest
```

### 网络安全配置

限制容器的网络访问可以减少攻击面。

**使用自定义网络**隔离容器通信：

```yaml
networks:
llm-internal:
    driver: bridge
    internal: true # 禁止容器访问外部网络

llm-public:
    driver: bridge
```

不同服务分配到不同网络层：

```yaml
services:
# 仅内部网络访问的推理服务
vllm-backend:
    networks:
      - llm-internal
    # 不暴露任何端口

# 可访问公网的API网关
api-gateway:
    networks:
      - llm-public
      - llm-internal
    ports:
      - "80:80"
```

**TLS/SSL加密**保护传输中的数据：

```yaml
services:
api-gateway:
    image: nginx:alpine
    volumes:
      - ./ssl/cert.pem:/etc/nginx/ssl/cert.pem:ro
      - ./ssl/key.pem:/etc/nginx/ssl/key.pem:ro
    command: |
      nginx -c /etc/nginx/nginx.conf -g 'daemon off;'
```

### 镜像安全扫描

构建和部署前扫描镜像漏洞是安全流程的重要环节。

**使用Trivy扫描镜像：**

```bash
# 安装Trivy
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    aquasec/trivy image vllm/vllm-openai:latest

# 扫描已知漏洞
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    aquasec/trivy image --severity HIGH,CRITICAL \
    vllm/vllm-openai:latest

# 输出漏洞报告
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    -v ./reports:/reports \
    aquasec/trivy image \
    --format json --output /reports/vuln-report.json \
    vllm/vllm-openai:latest
```

**Docker Scout**是Docker官方提供的镜像安全分析工具：

```bash
# 登录Docker Hub
docker scout login

# 分析镜像
docker scout cves vllm/vllm-openai:latest

# 比较两个镜像版本
docker scout compare \
    --from vllm/vllm-openai:0.1 \
    --to vllm/vllm-openai:0.2
```

**CI/CD集成安全扫描：**

```yaml
# .github/workflows/security-scan.yml
name: Security Scan

on: [push, pull_request]

jobs:
trivy-scan:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'vllm/vllm-openai:latest'
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'

      - name: Upload Trivy results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
       with:
          sarif_file: 'trivy-results.sarif'
```

### 运行时安全监控

部署后的容器安全监控同样重要。

**Docker内置安全审计：**

```bash
# 查看容器运行状态
docker stats

# 检查容器权限
docker inspect --format '{{.HostConfig.Privileged}}' <container>

# 查看容器的Linux capabilities
docker inspect --format '{{.HostConfig.CapAdd}}' <container>
```

**Falco**是CNCF旗下的容器运行时安全监控工具：

```yaml
# docker-compose-falco.yml
version: '3.8'

services:
falco:
    image: falcosecurity/falco:latest
    container_name: falco
    privileged: true
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /host:/host:ro
      - ./falco-rules.yaml:/etc/falco/falco_rules.yaml:ro
    environment:
      - FALCO菲尔斯_ENABLED=true
    network_mode: host
```

Falco规则示例（falco-rules.yaml）：

```yaml
# 检测容器内运行shell
- rule: Shell spawned in container
desc: A shell was spawned by a non-shell program in a container
condition: >
    spawned_process and
    container and
    proc.name = bash
output: >
    Shell spawn detected (user=%user.name container_id=%container.id
    container_name=%container.name shell=%proc.name parent=%proc.pname
    cmdline=%proc.cmdline)
priority: NOTICE

# 检测特权容器
- rule: Privileged container started
desc: Privileged container started
condition: >
    container and
    container.privileged = true
output: >
    Privileged container started (id=%container.id
    image=%container.image.repository
    user=%user.name)
priority: CRITICAL
```

## Kubernetes部署大模型

### kubectl基础配置

Kubernetes（K8s）是生产级容器编排的事实标准，掌握kubectl基础操作是部署大模型的必备技能。

**集群上下文配置：**

```bash
# 查看当前集群上下文
kubectl config current-context

# 列出所有上下文
kubectl config get-contexts

# 切换上下文
kubectl config use-context production-cluster

# 查看集群信息
kubectl cluster-info
```

**Namespace管理：**

```bash
# 创建命名空间
kubectl create namespace llm-production

# 切换默认命名空间
kubectl config set-context --current --namespace=llm-production

# 在指定命名空间操作
kubectl get pods -n llm-production
```

**Deployment创建：**

```yaml
# llm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen-7b-deployment
namespace: llm-production
labels:
    app: qwen-7b
    version: v1
spec:
replicas: 2
selector:
    matchLabels:
      app: qwen-7b
template:
    metadata:
      labels:
        app: qwen-7b
        version: v1
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: MODEL_NAME
          value: "Qwen/Qwen-7B-Chat"
        - name: GPU_MEMORY_UTILIZATION
          value: "0.85"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "4"
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: CUDA_VISIBLE_DEVICES
          value: "all"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
       readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      nodeSelector:
        gpu-type: nvidia-rtx4090
      tolerations:
      - key: "nvidia.com/gpu"
       operator: "Exists"
        effect: "NoSchedule"
```

**Service创建：**

```yaml
# llm-service.yaml
apiVersion: v1
kind: Service
metadata:
name: qwen-7b-service
namespace: llm-production
spec:
type: ClusterIP
selector:
    app: qwen-7b
ports:
- name: http
    port: 8000
    targetPort: 8000
    protocol: TCP
```

**应用配置：**

```bash
kubectl apply -f llm-deployment.yaml
kubectl apply -f llm-service.yaml

# 查看部署状态
kubectl get deployments -n llm-production
kubectl get pods -n llm-production

# 查看日志
kubectl logs -n llm-production -l app=qwen-7b --tail=100

# 扩缩容
kubectl scale deployment qwen-7b-deployment --replicas=4 -n llm-production

# 更新镜像版本
kubectl set image deployment/qwen-7b-deployment \
    vllm=vllm/vllm-openai:v0.2.0 -n llm-production

# 回滚
kubectl rollout undo deployment/qwen-7b-deployment -n llm-production
```

### Helm Chart部署

Helm是Kubernetes的包管理工具，使用Helm Chart可以简化大模型服务的部署和配置管理。

**安装Helm：**

```bash
# Linux/macOS
curl -fsSL https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz | tar -xz
sudo mv linux-amd64/helm /usr/local/bin/helm

# Windows (使用Chocolatey)
choco install kubernetes-helm
```

**创建Helm Chart：**

```bash
helm create llm-service
cd llm-service
```

**Chart目录结构：**

```
llm-service/
├── Chart.yaml
├── values.yaml
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   └── _helpers.tpl
└── .helmignore
```

**Chart.yaml：**

```yaml
apiVersion: v2
name: llm-service
description: A Helm chart for LLM inference service
type: application
version: 1.0.0
appVersion: "latest"
keywords:
- llm
- vllm
- inference
maintainers:
- name: LLM Team
    email: llm-team@example.com
dependencies:
- name: common
    version: "1.x.x"
    repository: "https://charts.bitnami.com/bitnami"
```

**values.yaml：**

```yaml
replicaCount: 2

image:
repository: vllm/vllm-openai
tag: latest
pullPolicy: IfNotPresent

service:
type: ClusterIP
port: 8000

resources:
limits:
    nvidia.com/gpu: 1
    memory: 32Gi
    cpu: 8
requests:
    nvidia.com/gpu: 1
    memory: 16Gi
    cpu: 4

env:
MODEL_NAME: "Qwen/Qwen-7B-Chat"
GPU_MEMORY_UTILIZATION: "0.85"
MAX_MODEL_LEN: "8192"

autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80

ingress:
enabled: true
className: nginx
annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
    - host: llm.example.com
      paths:
        - path: /
          pathType: Prefix
tls:
    - secretName: llm-tls
      hosts:
        - llm.example.com

persistence:
enabled: true
storageClass: nfs-client
size: 100Gi

nodeSelector:
gpu-type: nvidia-rtx4090

tolerations:
- key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

affinity:
podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - llm-service
          topologyKey: kubernetes.io/hostname
```

**部署Helm Chart：**

```bash
# 添加依赖
helm dependency build llm-service

# 部署
helm install qwen-7b llm-service \
    --namespace llm-production \
    --create-namespace \
    --values llm-service/values.yaml

# 升级
helm upgrade qwen-7b llm-service \
    --namespace llm-production \
    --values llm-service/values-prod.yaml

# 查看部署状态
helm status qwen-7b -n llm-production

# 列出部署
helm list -n llm-production

# 删除部署
helm uninstall qwen-7b -n llm-production
```

### GPU调度与资源管理

Kubernetes原生支持GPU调度，但需要正确配置才能生效。

**检查GPU资源：**

```bash
# 查看节点GPU信息
kubectl describe nodes | grep -A 10 "nvidia.com/gpu"

# 查看GPU可用数量
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
```

**GPU配额管理：**

```yaml
# ResourceQuota限制命名空间GPU使用
apiVersion: v1
kind: ResourceQuota
metadata:
name: llm-gpu-quota
namespace: llm-production
spec:
hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"
---
# LimitRange设置默认值
apiVersion: v1
kind: LimitRange
metadata:
name: llm-limits
namespace: llm-production
spec:
limits:
- max:
      nvidia.com/gpu: 2
    default:
      nvidia.com/gpu: 1
    defaultRequest:
      nvidia.com/gpu: 1
    type: Container
```

**时间片调度（Time Slicing）：** 在共享GPU集群中，可以让多个容器分时使用同一张GPU：

```yaml
# device-plugin配置
apiVersion: apps/v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-config
namespace: gpu-operator
data:
config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4 # 将1张GPU划分为4个时间片
```

### 监控与日志集成

生产环境需要完善的监控和日志系统。

**Prometheus监控：**

```yaml
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'vllm'
      static_configs:
      - targets: ['vllm-service.llm-production:8000']
      metrics_path: /metrics
```

**Grafana仪表板：**

```yaml
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-grafana-dashboard
namespace: monitoring
labels:
    grafana_dashboard: "1"
data:
vllm-dashboard.json: |
    {
      "dashboard": {
        "title": "vLLM Inference Dashboard",
        "panels": [
          {
            "title": "Requests per Second",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(vllm_requests_total[5m])",
                "legendFormat": "{{model}}"
              }
            ]
          },
          {
            "title": "Average Latency",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m]))",
                "legendFormat": "p95"
              }
            ]
          },
          {
            "title": "GPU Memory Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "vllm_gpu_memory_usage_bytes / vllm_gpu_memory_total_bytes",
                "legendFormat": "Used {{gpu}}"
              }
            ]
          }
        ]
      }
    }
```

## 实战案例：docker-compose一键部署Qwen-7B推理服务

图4
deployment_flow.png

### 环境准备

在开始部署前，需要确保环境满足以下要求：

**硬件要求：** NVIDIA GPU（至少16GB显存，推荐RTX 4080或更高），8核以上CPU，32GB系统内存，100GB可用磁盘空间。

**软件要求：** Ubuntu 20.04+或CentOS 8+，NVIDIA Driver 535+，Docker 24+，NVIDIA Container Toolkit。

**环境验证：**

```bash
# 检查NVIDIA驱动
nvidia-smi
# 预期输出应显示GPU型号和驱动版本

# 检查Docker
docker --version
# 预期输出：Docker version 24.x.x

# 检查nvidia-container-toolkit
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
# 预期输出应显示GPU信息
```

### 编写docker-compose配置

创建项目目录并编写docker-compose.yml：

```bash
mkdir -p ~/llm-deploy/qwen-7b
cd ~/llm-deploy/qwen-7b
```

```yaml
# docker-compose.yml
version: '3.8'

services:
# Qwen-7B推理服务
qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    hostname: qwen7b-inference
    ports:
      - "8000:8000"
    environment:
      # 模型配置
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - TOKENIZER=Qwen/Qwen-7B-Chat
      # 性能优化
      - GPU_MEMORY_UTILIZATION=0.9
      - MAX_MODEL_LEN=8192
      - MAX_NUM_BATCHED_TOKENS=8192
      - MAX_NUM_SEQUENCES=256
      # 推理优化
      - ENABLE_CHUNKED_PREFILL=true
      - ENABLE_PREFIX_CACHING=true
      # GPU配置
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=all
      # 日志配置
      - LOG_LEVEL=INFO
    volumes:
      # HuggingFace模型缓存（避免重复下载）
      - model_cache:/root/.cache/huggingface
      # vLLM服务日志
      - ./logs:/var/log/vllm
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb' # 共享内存（对vLLM重要）
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    networks:
      - qwen-net

# API网关（Nginx）
api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      qwen7b-inference:
        condition: service_healthy
    restart: unless-stopped
    networks:
      - qwen-net

# Prometheus监控
prometheus:
    image: prom/prometheus:latest
    container_name: qwen7b-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    restart: unless-stopped
    networks:
      - qwen-net

# Grafana可视化
grafana:
    image: grafana/grafana:latest
    container_name: qwen7b-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - ./grafana/data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped
    networks:
      - qwen-net

networks:
qwen-net:
    driver: bridge

volumes:
model_cache:
```

### Nginx配置

```bash
mkdir -p nginx ssl prometheus grafana/provisioning/datasources grafana/provisioning/dashboards grafana/dashboards logs
```

```nginx
# nginx/nginx.conf
events {
    worker_connections 1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type application/octet-stream;

    # 日志格式
    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$http_x_forwarded_for" '
                    'rt=$request_time uct="$upstream_connect_time" '
                    'uht="$upstream_header_time" urt="$upstream_response_time"';

    access_log /var/log/nginx/access.log main;
    error_log /var/log/nginx/error.log warn;

    # 性能优化
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    # Gzip压缩
    gzip on;
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_types text/plain text/css text/xml application/json application/javascript application/rss+xml application/atom+xml image/svg+xml;

    # 限流配置
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/s;
    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

    upstream vllm_backend {
        server qwen7b-inference:8000;
        keepalive 64;
    }

    server {
        listen 80;
        server_name _;

        # 健康检查
        location /health {
            access_log off;
            return 200 "OK";
            add_header Content-Type text/plain;
        }

      # OpenAI兼容API - Chat Completions
        location /v1/chat/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

          # 代理头设置
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            # 流式响应（关键配置）
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_connect_timeout 60s;
            proxy_send_timeout 300s;

            # SSE必需配置
            proxy_set_header Connection '';
            chunked_transfer_encoding on;

            # 记录响应时间
            add_header X-Response-Time $upstream_response_time;
        }

        # OpenAI兼容API - Completions
        location /v1/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_set_header Connection '';
            chunked_transfer_encoding on;
        }

        # 模型列表
        location /v1/models {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }

        # vLLM直接接口
        location /v1 {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }

        # Prometheus指标（仅内网访问）
        location /metrics {
            proxy_pass http://prometheus:9090;
            proxy_set_header Host $host;
          allow 127.0.0.1;
            allow 10.0.0.0/8;
            allow 172.16.0.0/12;
            deny all;
        }
    }
}
```

### Prometheus配置

```yaml
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'vllm'
    static_configs:
      - targets: ['qwen7b-inference:8000']
    metrics_path: '/metrics'
    scrape_interval: 10s
```

### 启动服务

```bash
# 进入项目目录
cd ~/llm-deploy/qwen-7b

# 拉取镜像（提前完成，节省首次启动时间）
docker-compose pull

# 启动服务（后台运行）
docker-compose up -d

# 查看服务状态
docker-compose ps

# 查看日志
docker-compose logs -f qwen7b-inference

# 等待服务就绪后测试
# 首次启动需要下载模型，可能需要几分钟
sleep 120

# 健康检查
curl http://localhost:8000/health

# 模型列表
curl http://localhost:8000/v1/models

# 测试推理
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "Qwen/Qwen-7B-Chat",
    "messages": [
      {"role": "system", "content": "你是一个有帮助的AI助手。"},
      {"role": "user", "content": "请介绍一下Docker容器化技术的优势。"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
}'
```

### 性能测试

```bash
# 安装测试工具
pip install locust

# 创建负载测试脚本 locustfile.py
cat > locustfile.py << 'EOF'
from locust import HttpUser, task, between
import json

class LLMUser(HttpUser):
    wait_time = between(1, 3)
    host = "http://localhost"

    @task(3)
    def chat_completion(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "system", "content": "你是一个有帮助的AI助手。"},
                {"role": "user", "content": "请用100字介绍一下人工智能。"}
            ],
            "max_tokens": 200,
            "temperature": 0.7,
            "stream": False
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")

    @task(1)
    def chat_completion_stream(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "user", "content": "解释一下什么是机器学习。"}
            ],
            "max_tokens": 256,
            "temperature": 0.7,
            "stream": True
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")
EOF

# 运行负载测试（10个并发用户，持续60秒）
locust -f locustfile.py --headless -u 10 -r 2 -t 60s --csv results

# Web界面测试（可选）
locust -f locustfile.py --web-port 8089
```

## 容器化后的Spring Boot对接方案

### 概述

在企业级应用中，Java/Spring Boot仍然是主流的后端开发框架。将大模型推理服务容器化部署后，如何在Spring Boot应用中对接使用是开发者普遍关心的问题。

### RestTemplate对接方案

最直接的对接方式是使用Spring的RestTemplate或WebClient调用大模型API：

```java
package com.example.llm.config;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.http.client.SimpleClientHttpRequestFactory;
import org.springframework.web.client.RestTemplate;

@Configuration
public class RestTemplateConfig {

    @Bean
    public RestTemplate restTemplate() {
        SimpleClientHttpRequestFactory factory = new SimpleClientHttpRequestFactory();
        factory.setConnectTimeout(10000);      // 连接超时10秒
        factory.setReadTimeout(120000);        // 读取超时120秒（大模型推理可能较长）
        factory.setBufferRequestBody(true);

        return new RestTemplate(factory);
    }
}
```

### OpenAI兼容客户端封装

为了方便对接支持OpenAI协议的大模型服务，封装一个统一的客户端：

```java
package com.example.llm.client;

import com.fasterxml.jackson.annotation.JsonProperty;
import lombok.Data;
import lombok.Builder;
import lombok.AllArgsConstructor;
import lombok.NoArgsConstructor;

import java.util.List;
import java.util.Map;

public class LLMClient {

    private final RestTemplate restTemplate;
    private final String baseUrl;

    public LLMClient(RestTemplate restTemplate, String baseUrl) {
        this.restTemplate = restTemplate;
        this.baseUrl = baseUrl;
    }

    /**
     * 聊天补全
     */
    public ChatCompletionResponse chatCompletion(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ResponseEntity<ChatCompletionResponse> response = restTemplate.exchange(
            url,
            HttpMethod.POST,
            entity,
            ChatCompletionResponse.class
        );

        return response.getBody();
    }

    /**
     * 流式聊天补全
     */
    public Flux<String> chatCompletionStream(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);
        headers.setAccept(MediaType.TEXT_EVENT_STREAM);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ParameterizedTypeReference<ServerSentEvent<String>> type = new ParameterizedTypeReference<>() {};

        return restTemplate.exchange(
                url,
                HttpMethod.POST,
                entity,
                new ParameterizedTypeReference<ServerSentEvent<String>>() {}
            )
            .getBody()
            .stream()
            .map(ServerSentEvent::data);
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionRequest {
        private String model;
        private List<Message> messages;
        private Double temperature;
        @JsonProperty("max_tokens")
        private Integer maxTokens;
        @JsonProperty("top_p")
        private Double topP;
        private Boolean stream;
        private List<String> stop;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Message {
        private String role;
        private String content;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionResponse {
        private String id;
        private String object;
        private Long created;
        private String model;
        private List<Choice> choices;
        private Usage usage;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Choice {
        private Integer index;
        private Message message;
        @JsonProperty("finish_reason")
        private String finishReason;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Usage {
        @JsonProperty("prompt_tokens")
        private Integer promptTokens;
        @JsonProperty("completion_tokens")
        private Integer completionTokens;
        @JsonProperty("total_tokens")
        private Integer totalTokens;
    }
}
```

### Spring Boot服务实现

```java
package com.example.llm.service;

import com.example.llm.client.LLMClient.ChatCompletionRequest;
import com.example.llm.client.LLMClient.ChatCompletionResponse;
import com.example.llm.client.LLMClient.LLMClient;
import com.example.llm.client.LLMClient.Message;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;

import java.util.Arrays;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;

@Service
@RequiredArgsConstructor
@Slf4j
public class LLMService {

    private final LLMClient llmClient;

    /**
     * 同步对话
     */
    public String chat(String userMessage) {
        return chat(userMessage, null);
    }

    /**
     * 同步对话（带系统提示）
     */
    public String chat(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, false);

        try {
          ChatCompletionResponse response = llmClient.chatCompletion(request);

            if (response != null && response.getChoices() != null && !response.getChoices().isEmpty()) {
                return response.getChoices().get(0).getMessage().getContent();
            }
        } catch (Exception e) {
            log.error("LLM调用失败", e);
            throw new RuntimeException("LLM服务调用失败: " + e.getMessage(), e);
        }

        return null;
    }

    /**
     * 流式对话
     */
    public Flux<String> chatStream(String userMessage) {
        return chatStream(userMessage, null);
    }

    /**
     * 流式对话（带系统提示）
     */
    public Flux<String> chatStream(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, true);

        return llmClient.chatCompletionStream(request)
            .filter(data -> !data.startsWith("[DONE]"))
            .map(data -> {
                // 解析SSE数据
                if (data.startsWith("data:")) {
                    data = data.substring(5).trim();
                }
                return data;
            });
    }

    private ChatCompletionRequest buildRequest(String userMessage, String systemPrompt, boolean stream) {
        List<Message> messages = Arrays.asList(
            Message.builder()
                .role("system")
                .content(systemPrompt != null ? systemPrompt : "你是一个有帮助的AI助手。")
                .build(),
            Message.builder()
                .role("user")
                .content(userMessage)
                .build()
        );

        return ChatCompletionRequest.builder()
            .model("Qwen/Qwen-7B-Chat")
            .messages(messages)
            .temperature(0.7)
            .maxTokens(2048)
            .stream(stream)
            .build();
    }

    /**
     * 异步对话
     */
    public CompletableFuture<String> chatAsync(String userMessage) {
        return CompletableFuture.supplyAsync(() -> chat(userMessage))
            .orTimeout(120, TimeUnit.SECONDS);
    }

    /**
     * 批量对话
     */
    public List<String> batchChat(List<String> messages) {
        return messages.parallelStream()
            .map(this::chat)
            .toList();
    }
}
```

### 控制器实现

```java
package com.example.llm.controller;

import com.example.llm.service.LLMService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.MediaType;
import org.springframework.http.codec.ServerSentEvent;
import org.springframework.web.bind.annotation.*;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;

import java.util.HashMap;
import java.util.Map;

@RestController
@RequestMapping("/api/llm")
@RequiredArgsConstructor
@Slf4j
public class LLMController {

    private final LLMService llmService;

    /**
     * 同步对话接口
     */
    @PostMapping("/chat")
    public Map<String, Object> chat(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("收到聊天请求: {}", userMessage);

        String response = llmService.chat(userMessage, systemPrompt);

        Map<String, Object> result = new HashMap<>();
        result.put("code", 200);
        result.put("message", "success");
        result.put("data", response);

        return result;
    }

    /**
     * 流式对话接口
     */
    @PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<ServerSentEvent<String>> chatStream(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("收到流式聊天请求: {}", userMessage);

        return llmService.chatStream(userMessage, systemPrompt)
            .map(data -> ServerSentEvent.<String>builder()
                .data(data)
                .build()
            )
            .onErrorResume(e -> {
                log.error("流式对话异常", e);
                return Mono.just(ServerSentEvent.<String>builder()
                    .data("[ERROR]" + e.getMessage())
                    .build());
            })
            .concatWith(Mono.just(ServerSentEvent.<String>builder()
                .data("[DONE]")
                .build()));
    }

    /**
     * 异步对话接口
     */
    @PostMapping("/chat/async")
    public Map<String, Object> chatAsync(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");

        log.info("收到异步聊天请求: {}", userMessage);

        // 立即返回，后续异步处理
        llmService.chatAsync(userMessage)
            .thenAccept(response -> {
                log.info("异步任务完成: {}", response);
                // 这里可以添加回调逻辑，如发送WebSocket消息
            })
            .exceptionally(e -> {
                log.error("异步任务失败", e);
                return null;
            });

        Map<String, Object> result = new HashMap<>();
        result.put("code", 202);
        result.put("message", "请求已接受，正在异步处理");

        return result;
    }
}
```

### 熔断与重试配置

```java
package com.example.llm.config;

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.retry.RetryConfig;
import lombok.extern.slf4j.Slf4j;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.client.RestTemplate;

import java.time.Duration;

@Configuration
@Slf4j
public class ResilienceConfig {

    @Bean
    public CircuitBreaker llmCircuitBreaker() {
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)                    // 失败率阈值50%
            .slowCallRateThreshold(80)                    // 慢调用率阈值80%
            .slowCallDurationThreshold(Duration.ofSeconds(30)) // 慢调用阈值30秒
            .waitDurationInOpenState(Duration.ofMinutes(1))     // 熔断器开放时间1分钟
            .permittedNumberOfCallsInHalfOpenState(3)    // 半开状态允许调用次数
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .slidingWindowSize(10)                        // 滑动窗口大小
            .minimumNumberOfCalls(5)                      // 最小调用次数
            .build();

        return CircuitBreaker.of("llmService", config);
    }

    @Bean
    public Retry llmRetry() {
        RetryConfig config = RetryConfig.custom()
            .maxAttempts(3)                               // 最大重试次数
            .waitDuration(Duration.ofSeconds(2))          // 重试间隔
            .retryExceptions(Exception.class)             // 重试的异常类型
            .ignoreExceptions(IllegalArgumentException.class) // 忽略的异常
            .build();

        return Retry.of("llmService", config);
    }
}
```

### 应用配置

```yaml
# application.yml
spring:
application:
    name: llm-spring-boot-client

server:
port: 8080

llm:
# 大模型服务地址
base-url: http://localhost:80
# 连接超时（毫秒）
connect-timeout: 10000
# 读取超时（毫秒）
read-timeout: 120000

# 熔断器配置
resilience4j:
circuitbreaker:
    configs:
      default:
        failureRateThreshold: 50
        waitDurationInOpenState: 60s
    instances:
      llmService:
        registerHealthIndicator: true
        slidingWindowSize: 10

# 日志配置
logging:
level:
    com.example.llm: DEBUG
   org.springframework.web.client: DEBUG
```

### Docker化Spring Boot应用

如果需要将Spring Boot应用也容器化部署，可以编写Dockerfile：

```dockerfile
# Multi-stage build
FROM eclipse-temurin:17-jdk-alpine AS builder

WORKDIR /app

# 复制Maven配置
COPY pom.xml .
COPY src ./src

# 下载依赖（利用缓存）
RUN ./mvnw dependency:go-offline -B

# 构建
RUN ./mvnw package -DskipTests -B

# 运行阶段
FROM eclipse-temurin:17-jre-alpine

WORKDIR /app

# 安全配置
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
RUN chown -R appuser:appgroup /app

# 复制jar包
COPY --from=builder /app/target/*.jar app.jar

# 复制模型缓存（可选，共享HuggingFace缓存）
VOLUME ["/root/.cache/huggingface"]

USER appuser

EXPOSE 8080

ENTRYPOINT ["java", "-jar", "-Xms2g", "-Xmx4g", "-XX:+UseG1GC", "app.jar"]
```

相应的docker-compose配置：

```yaml
version: '3.8'

services:
spring-boot-app:
    build: ./spring-boot-app
    container_name: spring-boot-llm-client
    ports:
      - "8080:8080"
    environment:
      - LLM_BASE_URL=http://qwen7b-gateway:80
      - SPRING_PROFILES_ACTIVE=production
    depends_on:
      - api-gateway
    networks:
      - qwen-net

api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - qwen7b-inference
    networks:
      - qwen-net

qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.9
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb'
    networks:
      - qwen-net

networks:
qwen-net:
    driver: bridge

volumes:
model_cache:
```

## 总结

Docker容器化技术为大模型的部署提供了标准化、可复制、易扩展的解决方案。通过本文的详细介绍，我们已经覆盖了从环境配置、镜像制作、服务编排到生产部署的完整流程。

**核心要点回顾：**

1. **环境配置**：正确安装nvidia-container-toolkit是GPU容器化的前提，CUDA版本兼容性需要仔细规划。

2. **镜像制作**：选择合适的Base Image，合理利用多阶段构建减小镜像体积，通过层缓存优化构建速度。

3. **服务编排**：Docker Compose适合单主机多容器编排，Kubernetes适合大规模生产部署。

4. **性能优化**：合理的内存配置、批量推理、流式输出等优化手段可显著提升推理吞吐量。

5. **安全加固**：以非root用户运行、限制容器能力、启用TLS加密、进行安全扫描是基本的安全实践。

6. **Spring Boot对接**：通过封装统一的API客户端，Spring Boot应用可以方便地调用容器化部署的大模型服务。

随着容器化和云原生技术的持续发展，大模型的部署将变得更加简单高效。希望本文能够为你的大模型生产部署提供有价值的参考。

▲ 图2

集群模式部署配置

对于大规模推理服务，需要支持多实例部署和负载均衡。以下是一个支持水平扩展的集群部署配置：

### 环境准备

在开始部署前，需要确保环境满足以下要求：

**硬件要求：** NVIDIA GPU（至少16GB显存，推荐RTX 4080或更高），8核以上CPU，32GB系统内存，100GB可用磁盘空间。

**软件要求：** Ubuntu 20.04+或CentOS 8+，NVIDIA Driver 535+，Docker 24+，NVIDIA Container Toolkit。

**环境验证：**

```bash
# 检查NVIDIA驱动
nvidia-smi
# 预期输出应显示GPU型号和驱动版本

# 检查Docker
docker --version
# 预期输出：Docker version 24.x.x

# 检查nvidia-container-toolkit
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
# 预期输出应显示GPU信息
```

### 编写docker-compose配置

创建项目目录并编写docker-compose.yml：

```bash
mkdir -p ~/llm-deploy/qwen-7b
cd ~/llm-deploy/qwen-7b
```

```yaml
# docker-compose.yml
version: '3.8'

services:
# Qwen-7B推理服务
qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    hostname: qwen7b-inference
    ports:
      - "8000:8000"
    environment:
      # 模型配置
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - TOKENIZER=Qwen/Qwen-7B-Chat
      # 性能优化
      - GPU_MEMORY_UTILIZATION=0.9
      - MAX_MODEL_LEN=8192
      - MAX_NUM_BATCHED_TOKENS=8192
      - MAX_NUM_SEQUENCES=256
      # 推理优化
      - ENABLE_CHUNKED_PREFILL=true
      - ENABLE_PREFIX_CACHING=true
      # GPU配置
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=all
      # 日志配置
      - LOG_LEVEL=INFO
    volumes:
      # HuggingFace模型缓存（避免重复下载）
      - model_cache:/root/.cache/huggingface
      # vLLM服务日志
      - ./logs:/var/log/vllm
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb' # 共享内存（对vLLM重要）
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    networks:
      - qwen-net

# API网关（Nginx）
api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      qwen7b-inference:
        condition: service_healthy
    restart: unless-stopped
    networks:
      - qwen-net

# Prometheus监控
prometheus:
    image: prom/prometheus:latest
    container_name: qwen7b-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/prometheus_data:/prometheus
    command:
    - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    restart: unless-stopped
    networks:
      - qwen-net

# Grafana可视化
grafana:
    image: grafana/grafana:latest
    container_name: qwen7b-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - ./grafana/data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped
    networks:
      - qwen-net

networks:
qwen-net:
    driver: bridge

volumes:
model_cache:
```

### Nginx配置

```bash
mkdir -p nginx ssl prometheus grafana/provisioning/datasources grafana/provisioning/dashboards grafana/dashboards logs
```

```nginx
# nginx/nginx.conf
events {
    worker_connections 1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type application/octet-stream;

    # 日志格式
    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$http_x_forwarded_for" '
                    'rt=$request_time uct="$upstream_connect_time" '
                    'uht="$upstream_header_time" urt="$upstream_response_time"';

    access_log /var/log/nginx/access.log main;
    error_log /var/log/nginx/error.log warn;

    # 性能优化
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    # Gzip压缩
    gzip on;
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_types text/plain text/css text/xml application/json application/javascript application/rss+xml application/atom+xml image/svg+xml;

    # 限流配置
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/s;
    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

    upstream vllm_backend {
        server qwen7b-inference:8000;
        keepalive 64;
    }

    server {
        listen 80;
        server_name _;

        # 健康检查
        location /health {
            access_log off;
            return 200 "OK";
            add_header Content-Type text/plain;
        }

        # OpenAI兼容API - Chat Completions
        location /v1/chat/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            # 代理头设置
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            # 流式响应（关键配置）
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_connect_timeout 60s;
            proxy_send_timeout 300s;

            # SSE必需配置
            proxy_set_header Connection '';
            chunked_transfer_encoding on;

            # 记录响应时间
            add_header X-Response-Time $upstream_response_time;
        }

        # OpenAI兼容API - Completions
        location /v1/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

           proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_set_header Connection '';
            chunked_transfer_encoding on;
        }

        # 模型列表
        location /v1/models {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }

        # vLLM直接接口
        location /v1 {
            proxy_pass http://vllm_backend;
           proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }

        # Prometheus指标（仅内网访问）
        location /metrics {
            proxy_pass http://prometheus:9090;
            proxy_set_header Host $host;
            allow 127.0.0.1;
            allow 10.0.0.0/8;
            allow 172.16.0.0/12;
            deny all;
        }
    }
}
```

### Prometheus配置

```yaml
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'vllm'
    static_configs:
      - targets: ['qwen7b-inference:8000']
    metrics_path: '/metrics'
    scrape_interval: 10s
```

### 启动服务

```bash
# 进入项目目录
cd ~/llm-deploy/qwen-7b

# 拉取镜像（提前完成，节省首次启动时间）
docker-compose pull

# 启动服务（后台运行）
docker-compose up -d

# 查看服务状态
docker-compose ps

# 查看日志
docker-compose logs -f qwen7b-inference

# 等待服务就绪后测试
# 首次启动需要下载模型，可能需要几分钟
sleep 120

# 健康检查
curl http://localhost:8000/health

# 模型列表
curl http://localhost:8000/v1/models

# 测试推理
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "Qwen/Qwen-7B-Chat",
    "messages": [
      {"role": "system", "content": "你是一个有帮助的AI助手。"},
      {"role": "user", "content": "请介绍一下Docker容器化技术的优势。"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
}'
```

### 性能测试

```bash
# 安装测试工具
pip install locust

# 创建负载测试脚本 locustfile.py
cat > locustfile.py << 'EOF'
from locust import HttpUser, task, between
import json

class LLMUser(HttpUser):
    wait_time = between(1, 3)
    host = "http://localhost"

    @task(3)
    def chat_completion(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "system", "content": "你是一个有帮助的AI助手。"},
                {"role": "user", "content": "请用100字介绍一下人工智能。"}
            ],
            "max_tokens": 200,
            "temperature": 0.7,
            "stream": False
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")

    @task(1)
    def chat_completion_stream(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "user", "content": "解释一下什么是机器学习。"}
            ],
            "max_tokens": 256,
            "temperature": 0.7,
            "stream": True
        }
        with self.client.post(
            "/v1/chat/completions",
          json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")
EOF

# 运行负载测试（10个并发用户，持续60秒）
locust -f locustfile.py --headless -u 10 -r 2 -t 60s --csv results

# Web界面测试（可选）
locust -f locustfile.py --web-port 8089
```

## 容器化后的Spring Boot对接方案

### 概述

在企业级应用中，Java/Spring Boot仍然是主流的后端开发框架。将大模型推理服务容器化部署后，如何在Spring Boot应用中对接使用是开发者普遍关心的问题。

### RestTemplate对接方案

最直接的对接方式是使用Spring的RestTemplate或WebClient调用大模型API：

```java
package com.example.llm.config;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.http.client.SimpleClientHttpRequestFactory;
import org.springframework.web.client.RestTemplate;

@Configuration
public class RestTemplateConfig {

    @Bean
    public RestTemplate restTemplate() {
        SimpleClientHttpRequestFactory factory = new SimpleClientHttpRequestFactory();
        factory.setConnectTimeout(10000);      // 连接超时10秒
        factory.setReadTimeout(120000);         // 读取超时120秒（大模型推理可能较长）
        factory.setBufferRequestBody(true);

        return new RestTemplate(factory);
    }
}
```

### OpenAI兼容客户端封装

为了方便对接支持OpenAI协议的大模型服务，封装一个统一的客户端：

```java
package com.example.llm.client;

import com.fasterxml.jackson.annotation.JsonProperty;
import lombok.Data;
import lombok.Builder;
import lombok.AllArgsConstructor;
import lombok.NoArgsConstructor;

import java.util.List;
import java.util.Map;

public class LLMClient {

    private final RestTemplate restTemplate;
    private final String baseUrl;

    public LLMClient(RestTemplate restTemplate, String baseUrl) {
        this.restTemplate = restTemplate;
        this.baseUrl = baseUrl;
    }

    /**
     * 聊天补全
     */
    public ChatCompletionResponse chatCompletion(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ResponseEntity<ChatCompletionResponse> response = restTemplate.exchange(
            url,
            HttpMethod.POST,
            entity,
            ChatCompletionResponse.class
        );

        return response.getBody();
    }

    /**
     * 流式聊天补全
     */
    public Flux<String> chatCompletionStream(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);
        headers.setAccept(MediaType.TEXT_EVENT_STREAM);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ParameterizedTypeReference<ServerSentEvent<String>> type = new ParameterizedTypeReference<>() {};

        return restTemplate.exchange(
                url,
                HttpMethod.POST,
                entity,
                new ParameterizedTypeReference<ServerSentEvent<String>>() {}
            )
            .getBody()
            .stream()
            .map(ServerSentEvent::data);
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionRequest {
        private String model;
        private List<Message> messages;
        private Double temperature;
        @JsonProperty("max_tokens")
        private Integer maxTokens;
        @JsonProperty("top_p")
        private Double topP;
        private Boolean stream;
        private List<String> stop;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Message {
        private String role;
        private String content;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionResponse {
        private String id;
        private String object;
        private Long created;
        private String model;
        private List<Choice> choices;
        private Usage usage;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Choice {
        private Integer index;
        private Message message;
        @JsonProperty("finish_reason")
        private String finishReason;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Usage {
        @JsonProperty("prompt_tokens")
        private Integer promptTokens;
        @JsonProperty("completion_tokens")
        private Integer completionTokens;
        @JsonProperty("total_tokens")
        private Integer totalTokens;
    }
}
```

### Spring Boot服务实现

```java
package com.example.llm.service;

import com.example.llm.client.LLMClient.ChatCompletionRequest;
import com.example.llm.client.LLMClient.ChatCompletionResponse;
import com.example.llm.client.LLMClient.LLMClient;
import com.example.llm.client.LLMClient.Message;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;

import java.util.Arrays;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;

@Service
@RequiredArgsConstructor
@Slf4j
public class LLMService {

    private final LLMClient llmClient;

    /**
     * 同步对话
     */
    public String chat(String userMessage) {
        return chat(userMessage, null);
    }

    /**
     * 同步对话（带系统提示）
     */
    public String chat(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, false);

        try {
            ChatCompletionResponse response = llmClient.chatCompletion(request);

            if (response != null && response.getChoices() != null && !response.getChoices().isEmpty()) {
                return response.getChoices().get(0).getMessage().getContent();
            }
        } catch (Exception e) {
            log.error("LLM调用失败", e);
            throw new RuntimeException("LLM服务调用失败: " + e.getMessage(), e);
        }

        return null;
    }

    /**
     * 流式对话
     */
    public Flux<String> chatStream(String userMessage) {
        return chatStream(userMessage, null);
    }

    /**
     * 流式对话（带系统提示）
     */
    public Flux<String> chatStream(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, true);

        return llmClient.chatCompletionStream(request)
            .filter(data -> !data.startsWith("[DONE]"))
            .map(data -> {
                // 解析SSE数据
                if (data.startsWith("data:")) {
                    data = data.substring(5).trim();
                }
                return data;
            });
    }

    private ChatCompletionRequest buildRequest(String userMessage, String systemPrompt, boolean stream) {
        List<Message> messages = Arrays.asList(
            Message.builder()
                .role("system")
                .content(systemPrompt != null ? systemPrompt : "你是一个有帮助的AI助手。")
                .build(),
            Message.builder()
                .role("user")
                .content(userMessage)
                .build()
        );

        return ChatCompletionRequest.builder()
            .model("Qwen/Qwen-7B-Chat")
            .messages(messages)
            .temperature(0.7)
            .maxTokens(2048)
            .stream(stream)
            .build();
    }

    /**
     * 异步对话
     */
    public CompletableFuture<String> chatAsync(String userMessage) {
        return CompletableFuture.supplyAsync(() -> chat(userMessage))
            .orTimeout(120, TimeUnit.SECONDS);
    }

    /**
     * 批量对话
     */
    public List<String> batchChat(List<String> messages) {
        return messages.parallelStream()
            .map(this::chat)
            .toList();
    }
}
```

### 控制器实现

```java
package com.example.llm.controller;

import com.example.llm.service.LLMService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.MediaType;
import org.springframework.http.codec.ServerSentEvent;
import org.springframework.web.bind.annotation.*;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;

import java.util.HashMap;
import java.util.Map;

@RestController
@RequestMapping("/api/llm")
@RequiredArgsConstructor
@Slf4j
public class LLMController {

    private final LLMService llmService;

    /**
     * 同步对话接口
     */
    @PostMapping("/chat")
    public Map<String, Object> chat(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("收到聊天请求: {}", userMessage);

        String response = llmService.chat(userMessage, systemPrompt);

        Map<String, Object> result = new HashMap<>();
        result.put("code", 200);
        result.put("message", "success");
        result.put("data", response);

        return result;
    }

    /**
     * 流式对话接口
     */
    @PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<ServerSentEvent<String>> chatStream(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("收到流式聊天请求: {}", userMessage);

        return llmService.chatStream(userMessage, systemPrompt)
            .map(data -> ServerSentEvent.<String>builder()
                .data(data)
                .build()
            )
            .onErrorResume(e -> {
                log.error("流式对话异常", e);
                return Mono.just(ServerSentEvent.<String>builder()
                    .data("[ERROR]" + e.getMessage())
                    .build());
            })
            .concatWith(Mono.just(ServerSentEvent.<String>builder()
                .data("[DONE]")
                .build()));
    }

    /**
     * 异步对话接口
     */
    @PostMapping("/chat/async")
    public Map<String, Object> chatAsync(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");

        log.info("收到异步聊天请求: {}", userMessage);

        // 立即返回，后续异步处理
        llmService.chatAsync(userMessage)
            .thenAccept(response -> {
                log.info("异步任务完成: {}", response);
                // 这里可以添加回调逻辑，如发送WebSocket消息
            })
            .exceptionally(e -> {
                log.error("异步任务失败", e);
                return null;
            });

        Map<String, Object> result = new HashMap<>();
        result.put("code", 202);
        result.put("message", "请求已接受，正在异步处理");

        return result;
    }
}
```

### 熔断与重试配置

```java
package com.example.llm.config;

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.retry.RetryConfig;
import lombok.extern.slf4j.Slf4j;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.client.RestTemplate;

import java.time.Duration;

@Configuration
@Slf4j
public class ResilienceConfig {

    @Bean
    public CircuitBreaker llmCircuitBreaker() {
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)                    // 失败率阈值50%
            .slowCallRateThreshold(80)                    // 慢调用率阈值80%
            .slowCallDurationThreshold(Duration.ofSeconds(30)) // 慢调用阈值30秒
            .waitDurationInOpenState(Duration.ofMinutes(1))     // 熔断器开放时间1分钟
            .permittedNumberOfCallsInHalfOpenState(3)    // 半开状态允许调用次数
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .slidingWindowSize(10)                        // 滑动窗口大小
            .minimumNumberOfCalls(5)                      // 最小调用次数
            .build();

        return CircuitBreaker.of("llmService", config);
    }

    @Bean
    public Retry llmRetry() {
        RetryConfig config = RetryConfig.custom()
            .maxAttempts(3)                               // 最大重试次数
            .waitDuration(Duration.ofSeconds(2))          // 重试间隔
            .retryExceptions(Exception.class)             // 重试的异常类型
            .ignoreExceptions(IllegalArgumentException.class) // 忽略的异常
            .build();

        return Retry.of("llmService", config);
    }
}
```

### 应用配置

```yaml
# application.yml
spring:
application:
    name: llm-spring-boot-client

server:
port: 8080

llm:
# 大模型服务地址
base-url: http://localhost:80
# 连接超时（毫秒）
connect-timeout: 10000
# 读取超时（毫秒）
read-timeout: 120000

# 熔断器配置
resilience4j:
circuitbreaker:
    configs:
      default:
        failureRateThreshold: 50
        waitDurationInOpenState: 60s
    instances:
      llmService:
        registerHealthIndicator: true
        slidingWindowSize: 10

# 日志配置
logging:
level:
    com.example.llm: DEBUG
    org.springframework.web.client: DEBUG
```

### Docker化Spring Boot应用

如果需要将Spring Boot应用也容器化部署，可以编写Dockerfile：

```dockerfile
# Multi-stage build
FROM eclipse-temurin:17-jdk-alpine AS builder

WORKDIR /app

# 复制Maven配置
COPY pom.xml .
COPY src ./src

# 下载依赖（利用缓存）
RUN ./mvnw dependency:go-offline -B

# 构建
RUN ./mvnw package -DskipTests -B

# 运行阶段
FROM eclipse-temurin:17-jre-alpine

WORKDIR /app

# 安全配置
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
RUN chown -R appuser:appgroup /app

# 复制jar包
COPY --from=builder /app/target/*.jar app.jar

# 复制模型缓存（可选，共享HuggingFace缓存）
VOLUME ["/root/.cache/huggingface"]

USER appuser

EXPOSE 8080

ENTRYPOINT ["java", "-jar", "-Xms2g", "-Xmx4g", "-XX:+UseG1GC", "app.jar"]
```

相应的docker-compose配置：

```yaml
version: '3.8'

services:
spring-boot-app:
    build: ./spring-boot-app
    container_name: spring-boot-llm-client
    ports:
      - "8080:8080"
    environment:
      - LLM_BASE_URL=http://qwen7b-gateway:80
      - SPRING_PROFILES_ACTIVE=production
    depends_on:
      - api-gateway
    networks:
      - qwen-net

api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - qwen7b-inference
    networks:
      - qwen-net

qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.9
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb'
    networks:
      - qwen-net

networks:
qwen-net:
    driver: bridge

volumes:
model_cache:
```

## 总结

Docker容器化技术为大模型的部署提供了标准化、可复制、易扩展的解决方案。通过本文的详细介绍，我们已经覆盖了从环境配置、镜像制作、服务编排到生产部署的完整流程。

**核心要点回顾：**

1. **环境配置**：正确安装nvidia-container-toolkit是GPU容器化的前提，CUDA版本兼容性需要仔细规划。

2. **镜像制作**：选择合适的Base Image，合理利用多阶段构建减小镜像体积，通过层缓存优化构建速度。

3. **服务编排**：Docker Compose适合单主机多容器编排，Kubernetes适合大规模生产部署。

4. **性能优化**：合理的内存配置、批量推理、流式输出等优化手段可显著提升推理吞吐量。

5. **安全加固**：以非root用户运行、限制容器能力、启用TLS加密、进行安全扫描是基本的安全实践。

6. **Spring Boot对接**：通过封装统一的API客户端，Spring Boot应用可以方便地调用容器化部署的大模型服务。

随着容器化和云原生技术的持续发展，大模型的部署将变得更加简单高效。希望本文能够为你的大模型生产部署提供有价值的参考。

▲ 图4

环境准备

在开始部署前，需要确保环境满足以下要求：

**硬件要求：** NVIDIA GPU（至少16GB显存，推荐RTX 4080或更高），8核以上CPU，32GB系统内存，100GB可用磁盘空间。

**软件要求：** Ubuntu 20.04+或CentOS 8+，NVIDIA Driver 535+，Docker 24+，NVIDIA Container Toolkit。

**环境验证：**

```bash

检查NVIDIA驱动

nvidia-smi

预期输出应显示GPU型号和驱动版本

检查Docker

docker --version

预期输出：Docker version 24.x.x

检查nvidia-container-toolkit

docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

预期输出应显示GPU信息

```

编写docker-compose配置

创建项目目录并编写docker-compose.yml：

```bash

mkdir -p ~/llm-deploy/qwen-7b

cd ~/llm-deploy/qwen-7b

```

```yaml

docker-compose.yml

version: '3.8'

services:

# Qwen-7B推理服务

qwen7b-inference:

image: vllm/vllm-openai:latest

container_name: qwen7b-inference

hostname: qwen7b-inference

ports:

- "8000:8000"

environment:

# 模型配置

- MODEL_NAME=Qwen/Qwen-7B-Chat

- TOKENIZER=Qwen/Qwen-7B-Chat

# 性能优化

- GPU_MEMORY_UTILIZATION=0.9

- MAX_MODEL_LEN=8192

- MAX_NUM_BATCHED_TOKENS=8192

- MAX_NUM_SEQUENCES=256

# 推理优化

- ENABLE_CHUNKED_PREFILL=true

- ENABLE_PREFIX_CACHING=true

# GPU配置

- NVIDIA_VISIBLE_DEVICES=all

- CUDA_VISIBLE_DEVICES=all

# 日志配置

- LOG_LEVEL=INFO

volumes:

# HuggingFace模型缓存（避免重复下载）

- model_cache:/root/.cache/huggingface

# vLLM服务日志

- ./logs:/var/log/vllm

deploy:

resources:

reservations:

devices:

- driver: nvidia

capabilities: [gpu]

shm_size: '16gb' # 共享内存（对vLLM重要）

restart: unless-stopped

healthcheck:

test: ["CMD", "curl", "-f", "http://localhost:8000/health"]

interval: 30s

timeout: 10s

retries: 3

start_period: 300s

networks:

- qwen-net

# API网关（Nginx）

api-gateway:

image: nginx:alpine

container_name: qwen7b-gateway

ports:

- "80:80"

- "443:443"

volumes:

- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro

- ./nginx/ssl:/etc/nginx/ssl:ro

depends_on:

qwen7b-inference:

condition: service_healthy

restart: unless-stopped

networks:

- qwen-net

# Prometheus监控

prometheus:

image: prom/prometheus:latest

container_name: qwen7b-prometheus

ports:

- "9090:9090"

volumes:

- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro

- ./prometheus/prometheus_data:/prometheus

command:

- '--config.file=/etc/prometheus/prometheus.yml'

- '--storage.tsdb.path=/prometheus'

- '--web.console.libraries=/etc/prometheus/console_libraries'

- '--web.console.templates=/etc/prometheus/consoles'

restart: unless-stopped

networks:

- qwen-net

# Grafana可视化

grafana:

image: grafana/grafana:latest

container_name: qwen7b-grafana

ports:

- "3000:3000"

environment:

- GF_SECURITY_ADMIN_USER=admin

- GF_SECURITY_ADMIN_PASSWORD=admin123

- GF_USERS_ALLOW_SIGN_UP=false

volumes:

- ./grafana/provisioning:/etc/grafana/provisioning:ro

- ./grafana/dashboards:/var/lib/grafana/dashboards

- ./grafana/data:/var/lib/grafana

depends_on:

- prometheus

restart: unless-stopped

networks:

- qwen-net

networks:

qwen-net:

driver: bridge

volumes:

model_cache:

```

Nginx配置

```bash

mkdir -p nginx ssl prometheus grafana/provisioning/datasources grafana/provisioning/dashboards grafana/dashboards logs

```

```nginx

nginx/nginx.conf

events {

worker_connections 1024;

}

http {

include /etc/nginx/mime.types;

default_type application/octet-stream;

# 日志格式

log_format main '$remote_addr - $remote_user [$time_local] "$request" '

'$status $body_bytes_sent "$http_referer" '

'"$http_user_agent" "$http_x_forwarded_for" '

'rt=$request_time uct="$upstream_connect_time" '

'uht="$upstream_header_time" urt="$upstream_response_time"';

access_log /var/log/nginx/access.log main;

error_log /var/log/nginx/error.log warn;

# 性能优化

sendfile on;

tcp_nopush on;

tcp_nodelay on;

keepalive_timeout 65;

types_hash_max_size 2048;

# Gzip压缩

gzip on;

gzip_vary on;

gzip_proxied any;

gzip_comp_level 6;

gzip_types text/plain text/css text/xml application/json application/javascript application/rss+xml application/atom+xml image/svg+xml;

# 限流配置

limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/s;

limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

upstream vllm_backend {

server qwen7b-inference:8000;

keepalive 64;

}

server {

listen 80;

server_name _;

# 健康检查

location /health {

access_log off;

return 200 "OK";

add_header Content-Type text/plain;

}

# OpenAI兼容API - Chat Completions

location /v1/chat/completions {

limit_req zone=api_limit burst=50 nodelay;

limit_conn conn_limit 10;

proxy_pass http://vllm_backend;

proxy_http_version 1.1;

# 代理头设置

proxy_set_header Host $host;

proxy_set_header X-Real-IP $remote_addr;

proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

proxy_set_header X-Forwarded-Proto $scheme;

# 流式响应（关键配置）

proxy_buffering off;

proxy_cache off;

proxy_read_timeout 300s;

proxy_connect_timeout 60s;

proxy_send_timeout 300s;

# SSE必需配置

proxy_set_header Connection '';

chunked_transfer_encoding on;

# 记录响应时间

add_header X-Response-Time $upstream_response_time;

}

# OpenAI兼容API - Completions

location /v1/completions {

limit_req zone=api_limit burst=50 nodelay;

limit_conn conn_limit 10;

proxy_pass http://vllm_backend;

proxy_http_version 1.1;

proxy_set_header Host $host;

proxy_set_header X-Real-IP $remote_addr;

proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

proxy_set_header X-Forwarded-Proto $scheme;

proxy_buffering off;

proxy_cache off;

proxy_read_timeout 300s;

proxy_set_header Connection '';

chunked_transfer_encoding on;

}

# 模型列表

location /v1/models {

proxy_pass http://vllm_backend;

proxy_http_version 1.1;

proxy_set_header Host $host;

proxy_set_header X-Real-IP $remote_addr;

}

# vLLM直接接口

location /v1 {

proxy_pass http://vllm_backend;

proxy_http_version 1.1;

proxy_set_header Host $host;

proxy_set_header X-Real-IP $remote_addr;

proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

proxy_set_header X-Forwarded-Proto $scheme;

}

# Prometheus指标（仅内网访问）

location /metrics {

proxy_pass http://prometheus:9090;

proxy_set_header Host $host;

allow 127.0.0.1;

allow 10.0.0.0/8;

allow 172.16.0.0/12;

deny all;

}

```

Prometheus配置

```yaml

prometheus/prometheus.yml

global:

scrape_interval: 15s

evaluation_interval: 15s

scrape_configs:

- job_name: 'vllm'

static_configs:

- targets: ['qwen7b-inference:8000']

metrics_path: '/metrics'

scrape_interval: 10s

```

启动服务

```bash

进入项目目录

cd ~/llm-deploy/qwen-7b

拉取镜像（提前完成，节省首次启动时间）

docker-compose pull

启动服务（后台运行）

docker-compose up -d

查看服务状态

docker-compose ps

查看日志

docker-compose logs -f qwen7b-inference

等待服务就绪后测试

首次启动需要下载模型，可能需要几分钟

sleep 120

健康检查

curl http://localhost:8000/health

模型列表

curl http://localhost:8000/v1/models

测试推理

curl -X POST http://localhost:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "Qwen/Qwen-7B-Chat",

"messages": [

{"role": "system", "content": "你是一个有帮助的AI助手。"},

{"role": "user", "content": "请介绍一下Docker容器化技术的优势。"}

"max_tokens": 512,

"temperature": 0.7

```

性能测试

```bash

安装测试工具

pip install locust

创建负载测试脚本 locustfile.py

cat > locustfile.py << 'EOF'

from locust import HttpUser, task, between

import json

class LLMUser(HttpUser):

wait_time = between(1, 3)

host = "http://localhost"

@task(3)

def chat_completion(self):

payload = {

"model": "Qwen/Qwen-7B-Chat",

"messages": [

{"role": "system", "content": "你是一个有帮助的AI助手。"},

{"role": "user", "content": "请用100字介绍一下人工智能。"}

"max_tokens": 200,

"temperature": 0.7,

"stream": False

}

with self.client.post(

"/v1/chat/completions",

json=payload,

catch_response=True

) as response:

if response.status_code == 200:

response.success()

else:

response.failure(f"Failed with status {response.status_code}")

@task(1)

def chat_completion_stream(self):

payload = {

"model": "Qwen/Qwen-7B-Chat",

"messages": [

{"role": "user", "content": "解释一下什么是机器学习。"}

"max_tokens": 256,

"temperature": 0.7,

"stream": True

}

with self.client.post(

"/v1/chat/completions",

json=payload,

catch_response=True

) as response:

if response.status_code == 200:

response.success()

else:

response.failure(f"Failed with status {response.status_code}")

EOF

运行负载测试（10个并发用户，持续60秒）

locust -f locustfile.py --headless -u 10 -r 2 -t 60s --csv results

Web界面测试（可选）

locust -f locustfile.py --web-port 8089

```

容器化后的Spring Boot对接方案

概述

在企业级应用中，Java/Spring Boot仍然是主流的后端开发框架。将大模型推理服务容器化部署后，如何在Spring Boot应用中对接使用是开发者普遍关心的问题。

RestTemplate对接方案

最直接的对接方式是使用Spring的RestTemplate或WebClient调用大模型API：

```java

package com.example.llm.config;

import org.springframework.context.annotation.Bean;

import org.springframework.context.annotation.Configuration;

import org.springframework.http.client.SimpleClientHttpRequestFactory;

import org.springframework.web.client.RestTemplate;

@Configuration

public class RestTemplateConfig {

@Bean

public RestTemplate restTemplate() {

SimpleClientHttpRequestFactory factory = new SimpleClientHttpRequestFactory();

factory.setConnectTimeout(10000); // 连接超时10秒

factory.setReadTimeout(120000); // 读取超时120秒（大模型推理可能较长）

factory.setBufferRequestBody(true);

return new RestTemplate(factory);

}

```

OpenAI兼容客户端封装

为了方便对接支持OpenAI协议的大模型服务，封装一个统一的客户端：

```java

package com.example.llm.client;

import com.fasterxml.jackson.annotation.JsonProperty;

import lombok.Data;

import lombok.Builder;

import lombok.AllArgsConstructor;

import lombok.NoArgsConstructor;

import java.util.List;

import java.util.Map;

public class LLMClient {

private final RestTemplate restTemplate;

private final String baseUrl;

public LLMClient(RestTemplate restTemplate, String baseUrl) {

this.restTemplate = restTemplate;

this.baseUrl = baseUrl;

}

/**

* 聊天补全

public ChatCompletionResponse chatCompletion(ChatCompletionRequest request) {

String url = baseUrl + "/v1/chat/completions";

HttpHeaders headers = new HttpHeaders();

headers.setContentType(MediaType.APPLICATION_JSON);

HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

ResponseEntity<ChatCompletionResponse> response = restTemplate.exchange(

url,

HttpMethod.POST,

entity,

ChatCompletionResponse.class

);

return response.getBody();

}

/**

* 流式聊天补全

public Flux<String> chatCompletionStream(ChatCompletionRequest request) {

String url = baseUrl + "/v1/chat/completions";

HttpHeaders headers = new HttpHeaders();

headers.setContentType(MediaType.APPLICATION_JSON);

headers.setAccept(MediaType.TEXT_EVENT_STREAM);

HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

ParameterizedTypeReference<ServerSentEvent<String>> type = new ParameterizedTypeReference<>() {};

return restTemplate.exchange(

url,

HttpMethod.POST,

entity,

new ParameterizedTypeReference<ServerSentEvent<String>>() {}

)

.getBody()

.stream()

.map(ServerSentEvent::data);

}

@Data

@Builder

@NoArgsConstructor

@AllArgsConstructor

public static class ChatCompletionRequest {

private String model;

private List<Message> messages;

private Double temperature;

@JsonProperty("max_tokens")

private Integer maxTokens;

@JsonProperty("top_p")

private Double topP;

private Boolean stream;

private List<String> stop;

}

@Data

@Builder

@NoArgsConstructor

@AllArgsConstructor

public static class Message {

private String role;

private String content;

}

@Data

@Builder

@NoArgsConstructor

@AllArgsConstructor

public static class ChatCompletionResponse {

private String id;

private String object;

private Long created;

private String model;

private List<Choice> choices;

private Usage usage;

}

@Data

@Builder

@NoArgsConstructor

@AllArgsConstructor

public static class Choice {

private Integer index;

private Message message;

@JsonProperty("finish_reason")

private String finishReason;

}

@Data

@Builder

@NoArgsConstructor

@AllArgsConstructor

public static class Usage {

@JsonProperty("prompt_tokens")

private Integer promptTokens;

@JsonProperty("completion_tokens")

private Integer completionTokens;

@JsonProperty("total_tokens")

private Integer totalTokens;

}

```

Spring Boot服务实现

```java

package com.example.llm.service;

import com.example.llm.client.LLMClient.ChatCompletionRequest;

import com.example.llm.client.LLMClient.ChatCompletionResponse;

import com.example.llm.client.LLMClient.LLMClient;

import com.example.llm.client.LLMClient.Message;

import lombok.RequiredArgsConstructor;

import lombok.extern.slf4j.Slf4j;

import org.springframework.stereotype.Service;

import java.util.Arrays;

import java.util.List;

import java.util.concurrent.CompletableFuture;

import java.util.concurrent.TimeUnit;

@Service

@RequiredArgsConstructor

@Slf4j

public class LLMService {

private final LLMClient llmClient;

/**

* 同步对话

public String chat(String userMessage) {

return chat(userMessage, null);

}

/**

* 同步对话（带系统提示）

public String chat(String userMessage, String systemPrompt) {

ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, false);

try {

ChatCompletionResponse response = llmClient.chatCompletion(request);

if (response != null && response.getChoices() != null && !response.getChoices().isEmpty()) {

return response.getChoices().get(0).getMessage().getContent();

}

} catch (Exception e) {

log.error("LLM调用失败", e);

throw new RuntimeException("LLM服务调用失败: " + e.getMessage(), e);

}

return null;

}

/**

* 流式对话

public Flux<String> chatStream(String userMessage) {

return chatStream(userMessage, null);

}

/**

* 流式对话（带系统提示）

public Flux<String> chatStream(String userMessage, String systemPrompt) {

ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, true);

return llmClient.chatCompletionStream(request)

.filter(data -> !data.startsWith("[DONE]"))

.map(data -> {

// 解析SSE数据

if (data.startsWith("data:")) {

data = data.substring(5).trim();

}

return data;

});

}

private ChatCompletionRequest buildRequest(String userMessage, String systemPrompt, boolean stream) {

List<Message> messages = Arrays.asList(

Message.builder()

.role("system")

.content(systemPrompt != null ? systemPrompt : "你是一个有帮助的AI助手。")

.build(),

Message.builder()

.role("user")

.content(userMessage)

.build()

);

return ChatCompletionRequest.builder()

.model("Qwen/Qwen-7B-Chat")

.messages(messages)

.temperature(0.7)

.maxTokens(2048)

.stream(stream)

.build();

}

/**

* 异步对话

public CompletableFuture<String> chatAsync(String userMessage) {

return CompletableFuture.supplyAsync(() -> chat(userMessage))

.orTimeout(120, TimeUnit.SECONDS);

}

/**

* 批量对话

public List<String> batchChat(List<String> messages) {

return messages.parallelStream()

.map(this::chat)

.toList();

}

```

控制器实现

```java

package com.example.llm.controller;

import com.example.llm.service.LLMService;

import lombok.RequiredArgsConstructor;

import lombok.extern.slf4j.Slf4j;

import org.springframework.http.MediaType;

import org.springframework.http.codec.ServerSentEvent;

import org.springframework.web.bind.annotation.*;

import reactor.core.publisher.Flux;

import reactor.core.publisher.Mono;

import java.util.HashMap;

import java.util.Map;

@RestController

@RequestMapping("/api/llm")

@RequiredArgsConstructor

@Slf4j

public class LLMController {

private final LLMService llmService;

/**

* 同步对话接口

@PostMapping("/chat")

public Map<String, Object> chat(@RequestBody Map<String, String> request) {

String userMessage = request.get("message");

String systemPrompt = request.get("systemPrompt");

log.info("收到聊天请求: {}", userMessage);

String response = llmService.chat(userMessage, systemPrompt);

Map<String, Object> result = new HashMap<>();

result.put("code", 200);

result.put("message", "success");

result.put("data", response);

return result;

}

/**

* 流式对话接口

@PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)

public Flux<ServerSentEvent<String>> chatStream(@RequestBody Map<String, String> request) {

String userMessage = request.get("message");

String systemPrompt = request.get("systemPrompt");

log.info("收到流式聊天请求: {}", userMessage);

return llmService.chatStream(userMessage, systemPrompt)

.map(data -> ServerSentEvent.<String>builder()

.data(data)

.build()

)

.onErrorResume(e -> {

log.error("流式对话异常", e);

return Mono.just(ServerSentEvent.<String>builder()

.data("[ERROR]" + e.getMessage())

.build());

})

.concatWith(Mono.just(ServerSentEvent.<String>builder()

.data("[DONE]")

.build()));

}

/**

* 异步对话接口

@PostMapping("/chat/async")

public Map<String, Object> chatAsync(@RequestBody Map<String, String> request) {

String userMessage = request.get("message");

log.info("收到异步聊天请求: {}", userMessage);

// 立即返回，后续异步处理

llmService.chatAsync(userMessage)

.thenAccept(response -> {

log.info("异步任务完成: {}", response);

// 这里可以添加回调逻辑，如发送WebSocket消息

})

.exceptionally(e -> {

log.error("异步任务失败", e);

return null;

});

Map<String, Object> result = new HashMap<>();

result.put("code", 202);

result.put("message", "请求已接受，正在异步处理");

return result;

}

```

熔断与重试配置

```java

package com.example.llm.config;

import io.github.resilience4j.circuitbreaker.CircuitBreaker;

import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;

import io.github.resilience4j.retry.Retry;

import io.github.resilience4j.retry.RetryConfig;

import lombok.extern.slf4j.Slf4j;

import org.springframework.context.annotation.Bean;

import org.springframework.context.annotation.Configuration;

import org.springframework.web.client.RestTemplate;

import java.time.Duration;

@Configuration

@Slf4j

public class ResilienceConfig {

@Bean

public CircuitBreaker llmCircuitBreaker() {

CircuitBreakerConfig config = CircuitBreakerConfig.custom()

.failureRateThreshold(50) // 失败率阈值50%

.slowCallRateThreshold(80) // 慢调用率阈值80%

.slowCallDurationThreshold(Duration.ofSeconds(30)) // 慢调用阈值30秒

.waitDurationInOpenState(Duration.ofMinutes(1)) // 熔断器开放时间1分钟

.permittedNumberOfCallsInHalfOpenState(3) // 半开状态允许调用次数

.slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)

.slidingWindowSize(10) // 滑动窗口大小

.minimumNumberOfCalls(5) // 最小调用次数

.build();

return CircuitBreaker.of("llmService", config);

}

@Bean

public Retry llmRetry() {

RetryConfig config = RetryConfig.custom()

.maxAttempts(3) // 最大重试次数

.waitDuration(Duration.ofSeconds(2)) // 重试间隔

.retryExceptions(Exception.class) // 重试的异常类型

.ignoreExceptions(IllegalArgumentException.class) // 忽略的异常

.build();

return Retry.of("llmService", config);

}

```

应用配置

```yaml

application.yml

spring:

application:

server:

port: 8080

llm:

# 大模型服务地址

base-url: http://localhost:80

# 连接超时（毫秒）

connect-timeout: 10000

# 读取超时（毫秒）

read-timeout: 120000

熔断器配置

resilience4j:

circuitbreaker:

configs:

default:

failureRateThreshold: 50

waitDurationInOpenState: 60s

instances:

llmService:

registerHealthIndicator: true

slidingWindowSize: 10

日志配置

logging:

level:

com.example.llm: DEBUG

org.springframework.web.client: DEBUG

```

Docker化Spring Boot应用

如果需要将Spring Boot应用也容器化部署，可以编写Dockerfile：

```dockerfile

Multi-stage build

FROM eclipse-temurin:17-jdk-alpine AS builder

WORKDIR /app

复制Maven配置

COPY pom.xml .

COPY src ./src

下载依赖（利用缓存）

RUN ./mvnw dependency:go-offline -B

构建

RUN ./mvnw package -DskipTests -B

运行阶段

FROM eclipse-temurin:17-jre-alpine

WORKDIR /app

安全配置

RUN addgroup -S appgroup && adduser -S appuser -G appgroup

RUN chown -R appuser:appgroup /app

复制jar包

COPY --from=builder /app/target/*.jar app.jar

复制模型缓存（可选，共享HuggingFace缓存）

VOLUME ["/root/.cache/huggingface"]

USER appuser

EXPOSE 8080

ENTRYPOINT ["java", "-jar", "-Xms2g", "-Xmx4g", "-XX:+UseG1GC", "app.jar"]

```

相应的docker-compose配置：

```yaml

version: '3.8'

services:

spring-boot-app:

build: ./spring-boot-app

container_name: spring-boot-llm-client

ports:

- "8080:8080"

environment:

- LLM_BASE_URL=http://qwen7b-gateway:80

- SPRING_PROFILES_ACTIVE=production

depends_on:

- api-gateway

networks:

- qwen-net

api-gateway:

image: nginx:alpine

container_name: qwen7b-gateway

ports:

- "80:80"

volumes:

- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro

depends_on:

- qwen7b-inference

networks:

- qwen-net

qwen7b-inference:

image: vllm/vllm-openai:latest

container_name: qwen7b-inference

environment:

- MODEL_NAME=Qwen/Qwen-7B-Chat

- GPU_MEMORY_UTILIZATION=0.9

volumes:

- model_cache:/root/.cache/huggingface

deploy:

resources:

reservations:

devices:

- driver: nvidia

capabilities: [gpu]

shm_size: '16gb'

networks:

- qwen-net

networks:

qwen-net:

driver: bridge

volumes:

model_cache:

```

总结

Docker容器化技术为大模型的部署提供了标准化、可复制、易扩展的解决方案。通过本文的详细介绍，我们已经覆盖了从环境配置、镜像制作、服务编排到生产部署的完整流程。

**核心要点回顾：**

1. **环境配置**：正确安装nvidia-container-toolkit是GPU容器化的前提，CUDA版本兼容性需要仔细规划。

2. **镜像制作**：选择合适的Base Image，合理利用多阶段构建减小镜像体积，通过层缓存优化构建速度。

3. **服务编排**：Docker Compose适合单主机多容器编排，Kubernetes适合大规模生产部署。

4. **性能优化**：合理的内存配置、批量推理、流式输出等优化手段可显著提升推理吞吐量。

5. **安全加固**：以非root用户运行、限制容器能力、启用TLS加密、进行安全扫描是基本的安全实践。

6. **Spring Boot对接**：通过封装统一的API客户端，Spring Boot应用可以方便地调用容器化部署的大模型服务。

随着容器化和云原生技术的持续发展，大模型的部署将变得更加简单高效。希望本文能够为你的大模型生产部署提供有价值的参考。

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

鸿蒙 Flutter 实战：video_compress 3.1.4 适配 3.27-ohos 全流程

AtomGit开源社区

Java 内存模型（JMM）深度解析

在一个线程内，书写在前面的操作先行发生于书写在后面的操作。虽然 CPU 会为了性能进行指令重排，但 JMM 承诺"单线程执行结果的正确性"（即 as-if-serial 语义）。

AtomGit开源社区

DALL-E 系列模型详解

DALL-E 是 OpenAI 开发的一系列文本到图像生成模型，能够根据自然语言描述生成高质量、多样化的图像。DALL-E 的名字来源于：发展历程版本发布时间核心技术分辨率主要突破DALL-E 12021.1dVAE + Transformer256×256首次大规模文本到图像生成DALL-E 22022.4CLIP + Diffusion1024×1024照片级真实感DALL-E 32023.9

AtomGit开源社区

所有评论(0)

查看更多评论

布吉岛的石头

@fuleigang

已为社区贡献47条内容

Java 程序员第 30 阶段：Docker容器化部署大模型：标准化环境一键上线

布吉岛的石头

Docker容器化大模型的优势

NVIDIA Docker配置与GPU支持

大模型Docker镜像制作

检查NVIDIA驱动

预期输出应显示GPU型号和驱动版本

检查Docker

预期输出：Docker version 24.x.x

检查nvidia-container-toolkit

预期输出应显示GPU信息

docker-compose.yml

nginx/nginx.conf

prometheus/prometheus.yml

进入项目目录

拉取镜像（提前完成，节省首次启动时间）

启动服务（后台运行）

查看服务状态

查看日志

等待服务就绪后测试

首次启动需要下载模型，可能需要几分钟

健康检查

模型列表

测试推理

安装测试工具

创建负载测试脚本 locustfile.py

运行负载测试（10个并发用户，持续60秒）

Web界面测试（可选）

容器化后的Spring Boot对接方案

application.yml

熔断器配置

日志配置

Multi-stage build

复制Maven配置

下载依赖（利用缓存）

构建

运行阶段

安全配置

复制jar包

复制模型缓存（可选，共享HuggingFace缓存）

总结

所有评论(0)

温馨提示：您尚未绑定手机号

布吉岛的石头