作者:洛水石  |  适用:Java大模型面试

在人工智能飞速发展的今天,大语言模型(Large Language Model,LLM)已经成为企业智能化转型的核心基础设施。然而,大模型的部署却面临诸多挑战:复杂的依赖环境、CUDA版本兼容性、GPU资源调度、多模型协作等,每一个环节都可能成为生产落地的拦路虎。Docker容器化技术的出现,为这些问题提供了优雅的解决方案。

本文将深入探讨如何使用Docker容器化技术实现大模型的标准化部署,从环境配置、镜像制作、服务编排到生产级优化,形成一套完整的大模型容器化部署实践指南。

Docker容器化大模型的优势

环境标准化与一致性

传统的机器学习部署往往面临"在我机器上能跑"(It Works on My Machine)的困境。不同服务器的操作系统版本、CUDA驱动、Python库版本等细微差异,都可能导致模型推理行为不一致。Docker容器通过将应用及其所有依赖打包成镜像,确保了从开发到生产的完整环境一致性。

当企业需要部署多个大模型服务时,容器化优势更加明显。每个模型服务可以运行在独立的容器中,拥有自己的CUDA版本和依赖环境,彻底消除模型之间的依赖冲突。例如,可以使用一个容器运行基于CUDA 11.8的LLaMA-2-13B,另一个容器运行基于CUDA 12.1的Qwen-7B,它们之间完全隔离,互不影响。

资源隔离与高效利用

大模型推理是资源密集型任务,对GPU显存、内存、CPU都有较高要求。Docker容器提供了精细的资源隔离机制,可以为每个容器设置GPU数量、内存上限、CPU配额等参数。这种隔离不仅保证了服务质量,还提高了集群资源的整体利用率。

在共享GPU集群场景下,容器化允许将一张GPU卡分配给多个容器使用,每个容器只占用其中一部分显存。通过`nvidia-container-toolkit`提供的GPU虚拟化能力,可以实现GPU时间的时分复用,让小模型推理任务也能高效利用昂贵的GPU资源。

快速部署与弹性伸缩

Docker镜像是自包含的部署单元,只需一次构建,即可在任何支持Docker的环境中运行。这大大简化了大模型的分发和部署流程。对于需要频繁更新模型版本的场景,容器化可以实现蓝绿部署或金丝雀发布,将版本切换的停机时间压缩到秒级。

结合Kubernetes等容器编排平台,大模型服务可以根据负载情况自动伸缩。当请求量增加时,自动启动更多推理实例;当负载降低时,回收多余实例。这种弹性伸缩能力对于处理突发流量尤为重要,比如电商大促或热点事件引发的推理请求峰值。

简化运维与故障恢复

容器化架构使得大模型服务的运维工作变得简单可控。每个容器的文件系统、进程空间、网络栈都是独立的,容器内的故障不会影响宿主机和其他容器。当服务出现问题时,可以快速定位到具体容器,并利用容器快照进行故障恢复或问题复现。

Docker的日志驱动机制将容器日志统一收集到标准输出,配合ELK(Elasticsearch、Logstash、Kibana)或Loki等日志系统,可以实现全量日志的聚合查询和实时分析。这种集中化的日志管理对于排查生产环境问题至关重要。

1

NVIDIA Docker配置与GPU支持

nvidia-container-toolkit工作原理

在Docker容器中使用GPU,需要依赖nvidia-container-toolkit(原nvidia-docker2)提供的容器运行时支持。其核心原理是在容器启动时,将宿主机的NVIDIA驱动和GPU设备映射到容器内部,使容器内的CUDA程序能够直接访问物理GPU。

nvidia-container-toolkit通过Linux的设备映射(device mapper)和挂载命名空间(mount namespace)机制实现GPU穿透。当容器启动时,nvidia-container-runtime会拦截容器创建请求,将`/dev/nvidia*`设备节点和NVIDIA驱动库挂载到容器内部。这个过程对用户透明,开发者只需在`docker run`命令中指定`--gpus`参数即可。

技术实现上,nvidia-container-toolkit包含三个核心组件:`nvidia-container-runtime`负责与Docker通信,执行GPU挂载操作;`libnvidia-container`提供了GPU挂载的具体实现逻辑;`nvidia-container-toolkit`是命令行工具,用于配置容器运行时行为。这三者的协作使得GPU容器化成为可能。

CUDA版本兼容性矩阵

选择正确的CUDA版本是构建大模型Docker镜像的关键决策。CUDA版本的选择需要考虑多个因素:GPU驱动版本、深度学习框架兼容性、模型运行时要求等。以下是选择CUDA版本时需要遵循的核心原则:

**第一,容器内CUDA版本必须小于或等于主机驱动支持的最高CUDA版本。** 可以通过`nvidia-smi`命令查看主机驱动的CUDA支持版本。如果主机驱动为NVIDIA 535版本,那么容器内最高只能使用CUDA 12.1(12.x系列的最新稳定版)。强行使用更高版本的CUDA将导致运行时错误。

**第二,深度学习框架对CUDA版本有明确要求。** 以PyTorch为例,PyTorch 2.0+推荐使用CUDA 11.8或CUDA 12.1;TensorFlow 2.13+则推荐CUDA 12.1。使用不兼容的版本组合可能导致模型加载失败或推理结果异常。

**第三,cuDNN版本需要与CUDA版本匹配。** cuDNN(CUDA Deep Neural Network Library)是CUDA的深度学习加速库,其版本号遵循`major.minor.patch`格式。安装cuDNN时需要确认其与CUDA版本的兼容性,可以通过NVIDIA官方文档查询版本对应关系。

在实际项目中,推荐使用NVIDIA官方提供的CUDA基础镜像,如`nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04`,这类镜像已经过严格测试,包含了经过验证的CUDA和cuDNN版本组合。

GPU资源配额配置

在生产环境中,通常需要在单个GPU上运行多个容器,或者限制单个容器可以使用的GPU资源。Docker提供了多种GPU资源配额配置方式。

**GPU数量限制**通过`--gpus`参数实现,可以指定容器可以使用的GPU数量。例如,`--gpus '"device=0"'`限制容器只能使用GPU 0,`--gpus '"device=0,1"'`则允许容器使用GPU 0和1。对于需要GPU数量感知的模型推理,可以使用`--gpus all`授予容器使用所有可用GPU。

**显存限制**需要通过环境变量配置,不同的推理引擎有不同的显存配置方式。vLLM使用`NVIDIA_VISIBLE_DEVICES`环境变量配合`gpu_memory_utilization`参数;TensorRT则使用`TF_FORCE_GPU_ALLOW_GROWTH=true`允许显存动态增长。显尔的显存预留可以通过`NVIDIA_DRIVER_CAPABILITIES`设置容器的GPU功能集。

**GPU算力限制**在某些场景下可能需要限制容器使用GPU的算力比例。这可以通过NVIDIA的GPU分区工具(mig-manager)实现,将GPU划分为多个GPU实例(GPI),每个实例拥有独立的算力和显存资源。不过,GPU分区会引入额外的开销,对于延迟敏感的推理任务通常不推荐使用。

验证GPU容器配置

配置完成后,需要验证GPU是否可以正常访问。可以通过以下命令检查容器内的GPU环境:


## 大模型Docker镜像制作

### Base Image选择策略

选择合适的基础镜像是构建大模型Docker镜像的第一步。NVIDIA官方维护的CUDA基础镜像是最佳选择,它们经过严格测试,包含了经过验证的驱动和库组合。

**Ubuntu基础镜像 vs CentOS基础镜像:** Ubuntu是目前主流的Docker镜像操作系统,拥有更广泛的软件包支持和更快的安全更新。对于需要频繁更新安全补丁的生产环境,Ubuntu是更好的选择。CentOS则以其稳定性著称,适合对系统一致性要求极高的企业级场景。

**-devel镜像 vs runtime镜像:** NVIDIA CUDA镜像分为`devel``runtime`两个版本。`devel`镜像包含完整的编译工具链(gccnvcc等),体积较大(约十几GB),适合需要编译CUDA程序或构建依赖CUDA的程序场景。`runtime`镜像只包含运行时库,体积较小(约5GB),适合直接运行已编译好的CUDA程序。对于大模型推理服务,推荐使用`runtime`镜像以减小镜像体积和攻击面。

** alpine基础镜像的局限性:** 部分大模型官方镜像使用alpine作为基础系统以减小体积。但alpinelibcmusl而非glibc,可能导致某些Python扩展(如numpyscipy的底层C库)编译或运行问题。如果遇到奇怪的动态链接错误,可能需要切换到UbuntuCentOS基础镜像。

### CUDA版本与依赖环境配置

构建大模型Docker镜像时,需要在Dockerfile中显式指定CUDA版本和必要的依赖环境。以下是一个典型的Dockerfile结构:

```dockerfile
#
阶段一:构建阶段
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04 AS builder

#
设置环境变量
ENV DEBIAN_FRONTEND=noninteractive
ENV CUDA_VERSION=12.1.1
ENV PYTHON_VERSION=3.10

#
安装构建工具和Python
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    wget \
    && rm -rf /var/lib/apt/lists/*

#
设置Python符号链接
RUN ln -sf /usr/bin/python3.10 /usr/bin/python && \
    ln -sf /usr/bin/pip3 /usr/bin/pip

#
安装PyTorch(支持CUDA 12.1
RUN pip install --no-cache-dir \
    torch==2.1.0 \
    torchvision==0.16.0 \
    torchaudio==2.1.0 \
    --index-url https://download.pytorch.org/whl/cu121

#
安装推理框架
RUN pip install --no-cache-dir \
    vllm==0.2.0 \
    transformers==4.35.0 \
    accelerate==0.25.0

#
阶段二:运行时阶段
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

#
复制Python环境
COPY --from=builder /usr/bin/python /usr/bin/python
COPY --from=builder /usr/bin/pip /usr/bin/pip
COPY --from=builder /usr/lib/python3.10 /usr/lib/python3.10

#
复制已安装的Python
COPY --from=builder /usr/local/lib/python3.10/dist-packages /usr/local/lib/python3.10/dist-packages

#
安装运行时依赖
RUN apt-get update && apt-get install -y \
    libgomp1 \
    libgl1-mesa-glx \
    libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

#
设置工作目录
WORKDIR /app

#
复制应用代码
COPY app.py /app/
COPY requirements.txt /app/

#
暴露端口
EXPOSE 8000

#
启动命令
CMD ["python", "app.py"]
```

###
镜像构建优化技巧

大模型Docker镜像通常体积较大(动辄十几GB),优化构建过程可以显著缩短构建时间和减少镜像体积。

**多阶段构建**是减小镜像体积的有效方法。如上例所示,第一阶段使用`devel`镜像安装编译工具和依赖,生成最终的可执行文件;第二阶段使用`runtime`镜像,只复制运行所需的文件。Python`pip wheel`安装方式可以将已编译的包缓存起来,避免在运行时重新编译。

**层缓存优化**利用Docker的层缓存机制加速构建。将不经常变化的指令(如系统包安装、Python环境配置)放在Dockerfile前面,将频繁变化的指令(如应用代码复制)放在后面。这样,当应用代码变化时,可以复用前面不变的层,只重新构建后面的层。

**合并RUN指令**可以减少Docker镜像的层数。将多个`RUN`指令合并为一个,可以减小镜像体积并加快容器启动速度。但需要注意,合并后的RUN指令不宜过长,否则会降低Dockerfile的可读性和可维护性。

**使用.dockerignore文件**排除不需要的文件进入构建上下文,避免将大文件(如训练数据、缓存的模型权重)复制到镜像中,导致镜像体积膨胀。

### 构建参数与环境变量

为了提高Dockerfile的灵活性,可以使用`ARG``ENV`指令定义构建参数和环境变量:

```dockerfile
#
构建参数(可以在docker build时传入)
ARG CUDA_VERSION=12.1.1
ARG PYTHON_VERSION=3.10
ARG MODEL_NAME=Qwen-7B-Chat

#
环境变量(容器运行时可见)
ENV MODEL_NAME=${MODEL_NAME}
ENV CUDA_VISIBLE_DEVICES=0
ENV TRANSFORMERS_CACHE=/models/.cache
ENV HF_HOME=/models/.cache

#
使用构建参数动态选择基础镜像
FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-runtime-ubuntu22.04
```


构建命令示例:

```bash
docker build \
    --build-arg CUDA_VERSION=12.1.1 \
    --build-arg MODEL_NAME=Qwen-7B-Chat \
    -t my-llm-service:latest \
    -f Dockerfile .

```

##
主流大模型Docker镜像

### Ollama官方镜像

Ollama是近年来备受关注的大模型推理框架,以其极简的使用方式和出色的本地部署体验著称。Ollama官方提供了预构建的Docker镜像,极大地简化了本地部署流程。

**ollama/ollama**镜像是Ollama的核心运行时,包含了Ollama服务程序和完整的模型推理能力。使用该镜像启动服务非常简单:

```bash
docker run -d \
    --name ollama \
    -p 11434:11434 \
    --gpus all \
    ollama/ollama:latest
```


服务启动后,可以通过REST API访问Ollama的各种功能。Ollama支持的模型数量众多,包括LLaMA-2MistralQwenYi等主流开源大模型。模型管理通过Ollama命令行工具完成:

```bash
#
进入容器
docker exec -it ollama ollama

#
拉取模型
pull qwen:7b
pull llama2:13b
pull mistral:7b

#
运行推理
run qwen:7b "请介绍一下Docker容器化技术"
```

**Ollama
的优势**在于其模型管理的便捷性。Ollama会自动处理模型的下载、缓存和版本管理,避免了手动下载模型权重和配置文件的繁琐。Ollama还支持Modelfile,允许用户自定义模型的行为参数,如系统提示、温度系数、上下文窗口大小等。

**Ollama的局限**也很明显:Ollama主要面向个人用户和本地部署场景,其生产级特性(如高可用、水平扩展、详细的监控指标)相对较弱。对于需要构建高可用推理服务的企业用户,可能需要结合其他方案使用Ollama

### HuggingFace TGI镜像

HuggingFace Text Generation InferenceTGI)是HuggingFace官方推出的推理服务框架,专为生产级大模型部署设计。TGI镜像提供了完整的推理服务能力,包括批量推理、流式输出、张量并行等高级特性。

**ghcr.io/huggingface/text-generation-inference**TGI的官方镜像仓库。使用TGI部署Qwen-7B模型的示例:

```bash
docker run -d \
    --name tgi \
    --gpus all \
    -p 8080:80 \
    -e MODEL_ID=Qwen/Qwen-7B-Chat \
    -e QUANTIZE=bitsandbytes \
    -e MAX_INPUT_LENGTH=2048 \
    -e MAX_TOTAL_TOKENS=4096 \
    ghcr.io/huggingface/text-generation-inference:latest
```

**TGI
的核心特性**包括:

**批量推理优化**允许同时处理多个推理请求,通过动态批处理(Dynamic Batching)提高GPU利用率。TGI会根据请求情况自动将多个小批量合并成大批量处理,减少GPU空闲时间。

**张量并行**支持将大模型分布到多个GPU上运行。对于单卡无法容纳的大模型(如70B参数规模),张量并行可以将模型切分到多张GPU,突破单卡显存限制。配置非常简单,只需设置`--num_shard`参数即可。

**流式输出**通过Server-Sent EventsSSE)协议实现 token by token的流式返回,显著降低首token延迟,提升用户体验。TGI的流式输出实现经过优化,能够在高并发场景下保持稳定。

**Flash Attention**TGI默认启用的注意力机制优化。Flash Attention通过IO感知的注意力计算,显著降低显存占用同时提升计算效率,对于长上下文推理尤为重要。

### vLLM官方镜像

vLLMUC Berkeley开发的PagedAttention推理引擎,以其高吞吐量和低延迟著称。vLLM采用分页注意力机制管理KV缓存,相同硬件条件下,吞吐量可达HuggingFace实现的好几倍。

**vllm/vllm-openai**镜像是vLLM官方提供的兼容OpenAI API的推理服务镜像。使用该镜像启动vLLM服务:

```bash
docker run -d \
    --name vllm \
    --gpus all \
    -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e MODEL_NAME=Qwen/Qwen-7B-Chat \
    -e TOKENIZER=Qwen/Qwen-7B-Chat \
    vllm/vllm-openai:latest
```

**vLLM
的性能优势**来源于其创新的PagedAttention技术。传统推理过程中,KV缓存需要连续分配显存,但生成的token数量在请求处理完成前是未知的,导致显存碎片化严重。vLLMKV缓存组织成固定大小的"",像操作虚拟内存一样管理注意力缓存,显著提高显存利用率。

**Continuous Batching**vLLM的另一项关键技术。传统批处理需要等待一个批次中所有请求完成才能开始下一个批次,造成GPU空闲。Continuous Batching允许完成的请求立即退出,新请求即时加入,实现更高的GPU利用率。

**vLLM的适用场景**包括:对延迟敏感的在线推理服务、需要处理大量并发请求的场景、显存受限但需要运行更大模型的场景。vLLM目前主要支持NVIDIA GPU,对于AMD GPUCPU推理的支持仍在积极开发中。

### 其他值得关注的镜像

**lmz/cuda-lm-headers**镜像是国内开发者维护的轻量级CUDA开发环境,包含常用的深度学习开发工具和头文件,适合需要自定义CUDA kernel的场景。

**databricks/dolly**镜像是Databricks提供的Dolly大模型推理镜像,包含了优化过的推理服务和模型管理能力,适合企业级部署。

**deepinfra/deepstack**镜像是另一个值得关注的推理服务镜像,支持多种开源大模型,并提供了完善的OpenAI兼容API

## Docker Compose编排多服务

### 模型服务与API网关编排

在大模型生产环境中,通常需要将推理服务封装成RESTful API对外提供服务。结合API网关可以实现负载均衡、流量控制、安全认证等功能。以下是一个完整的模型服务与API网关编排示例:

```yaml
# docker-compose.yml
version: '3.8'

services:
  # vLLM
推理服务
  vllm-backend:
    image: vllm/vllm-openai:latest
    container_name: vllm-backend
    ports:
      - "8000:8000"
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - TOKENIZER=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.9
      - MAX_MODEL_LEN=8192
      - TENSOR_PARALLEL_SIZE=1
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s

  # Nginx
反向代理
  api-gateway:
    image: nginx:alpine
    container_name: api-gateway
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./ssl:/etc/nginx/ssl:ro
    depends_on:
      vllm-backend:
        condition: service_healthy
    networks:
      - llm-net
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:80/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  # Prometheus
监控
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    networks:
      - llm-net

networks:
  llm-net:
    driver: bridge

volumes:
  model_cache:
  prometheus_data:
```


对应的Nginx配置(nginx.conf):

```nginx
events {
    worker_connections 1024;
}

http {
    upstream vllm_backend {
        server vllm-backend:8000;
        keepalive 64;
    }

    #
限流配置
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    server {
        listen 80;
        server_name _;

        location /health {
            return 200 'OK';
            add_header Content-Type text/plain;
        }

        location /v1/chat/completions {
            limit_req zone=api_limit burst=20 nodelay;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            #
流式响应配置
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_connect_timeout 75s;

            #
处理Server-Sent Events
            proxy_set_header Connection '';
            chunked_transfer_encoding on;
        }

        location /v1/completions {
            limit_req zone=api_limit burst=20 nodelay;

            proxy_pass http://vllm_backend;
           
proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
        }

        #
模型列表接口
        location /v1/models {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
        }

        #
静态资源(前端监控面板)
        location /dashboard {
            alias /usr/share/nginx/html;
            index index.html;
        }
    }
}
```

###
模型服务与Redis缓存编排

在大模型推理场景中,合理使用缓存可以显著降低响应延迟和减少GPU计算压力。Redis缓存主要用于存储对话历史、推理结果、模型元数据等信息。

```yaml
# docker-compose-with-cache.yml
version: '3.8'

services:
  vllm-backend:
    image: vllm/vllm-openai:latest
    container_name: vllm-backend
    ports:
      - "8000:8000"
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.85
      - MAX_MODEL_LEN=8192
      - KV_CACHE_FP8_QUANTIZED=true
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    depends_on:
      redis-cache:
        condition: service_healthy

  # Redis
缓存服务
  redis-cache:
    image: redis:7-alpine
    container_name: redis-cache
    ports:
      - "6379:6379"
    command: redis-server --appendonly yes --maxmemory 2gb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    networks:
      - llm-net
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

  #
应用服务(处理缓存逻辑)
  app-service:
    image: python:3.10-slim
    container_name: app-service
    ports:
      - "8080:8080"
    environment:
      - REDIS_HOST=redis-cache
      - REDIS_PORT=6379
      - VLLM_HOST=vllm-backend
      - VLLM_PORT=8000
    volumes:
      - ./app:/app
    working_dir: /app
    command: python app.py
    networks:
      - llm-net
    depends_on:
      redis-cache:
        condition: service_healthy

networks:
  llm-net:
    driver: bridge

volumes:
  model_cache:
  redis_data:
```


应用服务的缓存逻辑示例(app.py):

```python
import os
import redis
import hashlib
import json
from flask import Flask, request, jsonify
import requests

app = Flask(__name__)

redis_client = redis.Redis(
    host=os.getenv('REDIS_HOST', 'localhost'),
    port=int(os.getenv('REDIS_PORT', 6379)),
    decode_responses=True
)

VLLM_URL = f"http://{os.getenv('VLLM_HOST', 'localhost')}:{os.getenv('VLLM_PORT', 8000)}/v1/chat/completions"

def generate_cache_key(messages, temperature, max_tokens):
    """
生成缓存键"""
    cache_data = json.dumps({
        'messages': messages,
        'temperature': temperature,
        'max_tokens': max_tokens
    }, sort_keys=True)
    return f"llm:response:{hashlib.sha256(cache_data.encode()).hexdigest()}"

def check_cache(cache_key):
    """
检查缓存"""
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    return None

def save_to_cache(cache_key, response_data, ttl=3600):
    """
保存到缓存"""
    redis_client.setex(cache_key, ttl, json.dumps(response_data))

@app.route('/chat', methods=['POST'])
def chat():
    data = request.json
    messages = data.get('messages', [])
    temperature = data.get('temperature', 0.7)
    max_tokens = data.get('max_tokens', 2048)
    use_cache = data.get('use_cache', True)

    #
生成缓存键
    cache_key = generate_cache_key(messages, temperature, max_tokens)

    #
尝试从缓存获取
    if use_cache:
        cached_response = check_cache(cache_key)
        if cached_response:
            cached_response['cached'] = True
            return jsonify(cached_response)

    #
调用vLLM服务
    payload = {
        'model': 'Qwen/Qwen-7B-Chat',
        'messages': messages,
        'temperature': temperature,
        'max_tokens': max_tokens,
        'stream': False
    }

    try:
        response = requests.post(VLLM_URL, json=payload, timeout=120)
        response.raise_for_status()
        result = response.json()

        #
保存到缓存
        if use_cache:
            save_to_cache(cache_key, result)

        result['cached'] = False
        return jsonify(result)
    except requests.exceptions.RequestException as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)
```


2
docker_compose.png

###
集群模式部署配置

对于大规模推理服务,需要支持多实例部署和负载均衡。以下是一个支持水平扩展的集群部署配置:

```yaml
# docker-compose-cluster.yml
version: '3.8'

services:
  # vLLM
推理实例1
  vllm-1:
    image: vllm/vllm-openai:latest
    container_name: vllm-1
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.85
      - MAX_MODEL_LEN=8192
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    ports:
      - "8001:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 180s

  # vLLM
推理实例2
  vllm-2:
    image: vllm/vllm-openai:latest
    container_name: vllm-2
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.85
      - MAX_MODEL_LEN=8192
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    ports:
      - "8002:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 180s

  # HAProxy
负载均衡
  haproxy:
    image: haproxy:alpine
    container_name: haproxy
    ports:
      - "80:8080"
      - "8404:8404"  # HAProxy stats page
    volumes:
      - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
    depends_on:
      - vllm-1
      - vllm-2
    networks:
      - llm-net
    healthcheck:
      test: ["CMD", "haproxy", "-c", "-f", "/usr/local/etc/haproxy/haproxy.cfg"]
      interval: 30s
      timeout: 10s
      retries: 3

  # Redis Session
存储
  redis:
    image: redis:7-alpine
    container_name: redis-session
    command: redis-server --appendonly yes --maxmemory 4gb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    networks:
      - llm-net
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

networks:
  llm-net:
    driver: bridge

volumes:
  model_cache:
  redis_data:
```

HAProxy
配置(haproxy.cfg):

```cfg
global
    log stdout format raw local0
    maxconn 4096
    stats socket /var/run/hapee-listen.sock mode 600 level admin
    stats timeout 2m

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    option  http-server-close
    option  forwardfor except 127.0.0.0/8
    timeout connect 30s
    timeout client  180s
    timeout server  300s
    timeout tunnel  600s
    timeout check   5s

#
统计页面
listen stats
    bind *:8404
    mode http
    stats enable
    stats uri /
    stats refresh 30s

#
推理服务负载均衡
frontend llm_frontend
    bind *:8080
    default_backend llm_backend

backend llm_backend
    mode http
    balance roundrobin
    option tcp-check
    tcp-check connect
    tcp-check expect string X-Request-Id

    # vLLM
实例
    server vllm1 vllm-1:8000 check inter 30s rise 2 fall 3 weight 100
    server vllm2 vllm-2:8000 check inter 30s rise 2 fall 3 weight 100

    #
健康检查
    http-check expect status 200
    option httpchk GET /health
    http-check expect string OK

    #
响应超时
    timeout server 300s

    #
会话保持
    stick-table type string len 32 size 100k expire 30m
    stick on req.cookie(X-Session-Id)

#
监控端点
frontend monitoring
    bind *:8081
    mode http
    default_backend metrics_backend

backend metrics_backend
    mode http
    stats enable
    stats uri /stats
    stats realm Haproxy\ Statistics
    stats auth admin:admin
```

##
容器化性能优化

### 内存与显存配置优化

大模型推理是内存密集型任务,合理配置内存和显存参数对性能影响显著。以下是各主要推理框架的内存优化配置:

**vLLM内存配置:**

```yaml
environment:
  - GPU_MEMORY_UTILIZATION=0.9        # GPU
显存使用率(默认0.9
  - MAX_MODEL_LEN=8192                # 最大模型长度
  - KV_CACHE_FP8_QUANTIZED=true      # KV缓存量化(需要GPU支持)
  - BLOCK_SIZE=16                     # KV缓存块大小
  - NUM_PREEMPTIVE_SAMPLES=4         # 抢占式采样数
```

`GPU_MEMORY_UTILIZATION`
控制vLLM可以使用的GPU显存比例。默认值0.9预留10%显存用于KV缓存和其他开销。如果遇到OOM错误,可以适当降低到0.80.7

**PyTorch内存配置:**

```python
import torch

#
开启梯度检查点以节省显存
model.gradient_checkpointing_enable()

#
设置PyTorch内存分配器策略
torch.cuda.set_per_process_memory_fraction(0.9)

#
启用TF32计算(Ampere架构GPU
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
```

**Transformer
模型内存优化:**

```python
from transformers import AutoModelForCausalLM

#
启用Flash Attention(需要安装flash-attn
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    torch_dtype=torch.float16,
    use_flash_attention_2=True  #
需要flash-attn
)

# 4-bit
量化加载
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    quantization_config=quantization_config,
    device_map="auto"
)
```

### GPU
分配策略

在多GPU环境下,选择合适的GPU分配策略对整体吞吐量至关重要。

**GPU单容器**是最简单的部署模式,适合模型可以完整加载到单张GPU的场景。配置简单,但GPU利用率可能不够高:

```yaml
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]
```

**
GPU张量并行**将单个模型的权重分布到多张GPU上,适合超大模型或需要极致吞吐量的场景:

```bash
#
使用vLLM的张量并行
docker run --gpus '"device=0,1,2,3"' \
    -e TENSOR_PARALLEL_SIZE=4 \
    vllm/vllm-openai:latest
```

**
多实例负载均衡**在每张GPU上运行独立的模型实例,通过负载均衡器分发请求。这种方式的吞吐量通常高于张量并行,因为每个实例可以独立进行批量处理:

```yaml
services:
  vllm-gpu0:
    # ...
配置...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0

  vllm-gpu1:
    # ...
配置...
   
deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=1
```

###
批量推理优化

批量推理是提高GPU利用率的有效手段。合理的批处理配置可以将吞吐量提升数倍。

**动态批处理(Dynamic Batching**由推理服务框架自动管理,自动将多个请求合并成批次处理:

```yaml
environment:
  # vLLM
动态批处理配置
  - MAX_NUM_BATCHED_TOKENS=8192      # 单批次最大token
  - MAX_NUM_SEQUENCES=256            # 单批次最大序列数
  - SERVING_NAME=Qwen-7B             # 模型名称
```

**
连续批处理(Continuous Batching**vLLM的默认批处理策略,允许在生成过程中动态添加新请求,无需等待整个批次完成:

```python
# vLLM
连续批处理示例
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen-7B-Chat")

#
批量推理请求
prompts = [
    "
请介绍一下Docker: ",
    "
什么是Kubernetes: ",
    "
解释一下微服务架构: "
]

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["\n", "
用户:", "assistant:"]
)

#
批量生成
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**
预批处理(Prefix Caching**对于具有相同系统提示的对话场景,可以复用已计算的KV缓存,显著降低首token延迟:

```yaml
environment:
  - ENABLE_PREFIX_CACHING=true    #
启用前缀缓存
  - NUM_SCHEDULER_STEPS=8          # 调度步数
```

###
推理延迟优化

在线推理场景对延迟敏感,需要综合优化以降低响应时间。

**流式输出**允许在生成过程中就开始返回token,显著降低用户的感知延迟:

```python
#
使用流式输出
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen-7B-Chat")
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
    stream=True  #
启用流式输出
)

#
流式生成
for output in llm.generate(["请介绍一下人工智能的发展历史"], sampling_params):
    # output
包含已生成的token
    print(output.outputs[0].text, end="", flush=True)
```

**KV
缓存量化**可以显著减少显存占用,允许更大的批量或更长的上下文:

```yaml
environment:
  - KV_CACHE_DTYPE=fp8              # KV
缓存数据类型
```

**CUDA Graph**
通过捕获和重放GPU操作来减少内核启动开销:

```yaml
environment:
  - ENABLE_CHUNKED_PREFILL=false     #
禁用分块预填充以降低延迟
```

## Docker
安全加固

### 容器安全基础配置

大模型推理服务通常处理敏感数据,容器安全加固至关重要。

**以非root用户运行容器**是基本的安全最佳实践:

```dockerfile
# Dockerfile
中添加
RUN groupadd -r llmuser && useradd -r -g llmuser llmuser

# ...
安装依赖 ...

USER llmuser
WORKDIR /home/llmuser

CMD ["python", "app.py"]
```

docker-compose.yml
中也可以指定运行用户:

```yaml
services:
  vllm-backend:
    image: vllm/vllm-openai:latest
    user: "1000:1000"  #
指定UID:GID
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
```

**
限制容器能力**通过`cap_drop`移除不必要的Linux能力:

```yaml
services:
  vllm-backend:
    image: vllm/vllm-openai:latest
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE  #
仅添加需要的capability
```

**
启用SELinux/AppArmor**为容器提供额外的访问控制:

```bash
# Docker
运行时的AppArmor配置
docker run --security-opt "apparmor=docker-default" \
    --gpus all \
    vllm/vllm-openai:latest
```

###
网络安全配置

限制容器的网络访问可以减少攻击面。

**使用自定义网络**隔离容器通信:

```yaml
networks:
  llm-internal:
    driver: bridge
    internal: true  #
禁止容器访问外部网络

  llm-public:
    driver: bridge
```


不同服务分配到不同网络层:

```yaml
services:
  #
仅内部网络访问的推理服务
  vllm-backend:
    networks:
      - llm-internal
    #
不暴露任何端口

  # 可访问公网的API网关
  api-gateway:
    networks:
      - llm-public
      - llm-internal
    ports:
      - "80:80"
```

**TLS/SSL
加密**保护传输中的数据:

```yaml
services:
  api-gateway:
    image: nginx:alpine
    volumes:
      - ./ssl/cert.pem:/etc/nginx/ssl/cert.pem:ro
      - ./ssl/key.pem:/etc/nginx/ssl/key.pem:ro
    command: |
      nginx -c /etc/nginx/nginx.conf -g 'daemon off;'
```

###
镜像安全扫描

构建和部署前扫描镜像漏洞是安全流程的重要环节。

**使用Trivy扫描镜像:**

```bash
#
安装Trivy
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    aquasec/trivy image vllm/vllm-openai:latest

#
扫描已知漏洞
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    aquasec/trivy image --severity HIGH,CRITICAL \
    vllm/vllm-openai:latest

#
输出漏洞报告
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    -v ./reports:/reports \
    aquasec/trivy image \
    --format json --output /reports/vuln-report.json \
    vllm/vllm-openai:latest
```

**Docker Scout**
Docker官方提供的镜像安全分析工具:

```bash
#
登录Docker Hub
docker scout login

#
分析镜像
docker scout cves vllm/vllm-openai:latest

#
比较两个镜像版本
docker scout compare \
    --from vllm/vllm-openai:0.1 \
    --to vllm/vllm-openai:0.2
```

**CI/CD
集成安全扫描:**

```yaml
# .github/workflows/security-scan.yml
name: Security Scan

on: [push, pull_request]

jobs:
  trivy-scan:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'vllm/vllm-openai:latest'
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'

      - name: Upload Trivy results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'
```

###
运行时安全监控

部署后的容器安全监控同样重要。

**Docker内置安全审计:**

```bash
#
查看容器运行状态
docker stats

#
检查容器权限
docker inspect --format '{{.HostConfig.Privileged}}' <container>

#
查看容器的Linux capabilities
docker inspect --format '{{.HostConfig.CapAdd}}' <container>
```

**Falco**
CNCF旗下的容器运行时安全监控工具:

```yaml
# docker-compose-falco.yml
version: '3.8'

services:
  falco:
    image: falcosecurity/falco:latest
    container_name: falco
    privileged: true
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /host:/host:ro
      - ./falco-rules.yaml:/etc/falco/falco_rules.yaml:ro
    environment:
      - FALCO
菲尔斯_ENABLED=true
    network_mode: host
```

Falco
规则示例(falco-rules.yaml):

```yaml
#
检测容器内运行shell
- rule: Shell spawned in container
  desc: A shell was spawned by a non-shell program in a container
  condition: >
    spawned_process and
    container and
    proc.name = bash
  output: >
    Shell spawn detected (user=%user.name container_id=%container.id
    container_name=%container.name shell=%proc.name parent=%proc.pname
    cmdline=%proc.cmdline)
  priority: NOTICE

#
检测特权容器
- rule: Privileged container started
  desc: Privileged container started
  condition: >
    container and
    container.privileged = true
  output: >
    Privileged container started (id=%container.id
    image=%container.image.repository
    user=%user.name)
  priority: CRITICAL
```

## Kubernetes
部署大模型

### kubectl基础配置

KubernetesK8s)是生产级容器编排的事实标准,掌握kubectl基础操作是部署大模型的必备技能。

**集群上下文配置:**

```bash
#
查看当前集群上下文
kubectl config current-context

#
列出所有上下文
kubectl config get-contexts

#
切换上下文
kubectl config use-context production-cluster

#
查看集群信息
kubectl cluster-info
```

**Namespace
管理:**

```bash
#
创建命名空间
kubectl create namespace llm-production

#
切换默认命名空间
kubectl config set-context --current --namespace=llm-production

#
在指定命名空间操作
kubectl get pods -n llm-production
```

**Deployment
创建:**

```yaml
# llm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen-7b-deployment
  namespace: llm-production
  labels:
    app: qwen-7b
    version: v1
spec:
  replicas: 2
  selector:
    matchLabels:
      app: qwen-7b
  template:
    metadata:
      labels:
        app: qwen-7b
        version: v1
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: MODEL_NAME
          value: "Qwen/Qwen-7B-Chat"
        - name: GPU_MEMORY_UTILIZATION
          value: "0.85"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "4"
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: CUDA_VISIBLE_DEVICES
          value: "all"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      nodeSelector:
        gpu-type: nvidia-rtx4090
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
```

**Service
创建:**

```yaml
# llm-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: qwen-7b-service
  namespace: llm-production
spec:
  type: ClusterIP
  selector:
    app: qwen-7b
  ports:
  - name: http
    port: 8000
    targetPort: 8000
    protocol: TCP
```

**
应用配置:**

```bash
kubectl apply -f llm-deployment.yaml
kubectl apply -f llm-service.yaml

#
查看部署状态
kubectl get deployments -n llm-production
kubectl get pods -n llm-production

#
查看日志
kubectl logs -n llm-production -l app=qwen-7b --tail=100

#
扩缩容
kubectl scale deployment qwen-7b-deployment --replicas=4 -n llm-production

#
更新镜像版本
kubectl set image deployment/qwen-7b-deployment \
    vllm=vllm/vllm-openai:v0.2.0 -n llm-production

#
回滚
kubectl rollout undo deployment/qwen-7b-deployment -n llm-production
```

### Helm Chart
部署

HelmKubernetes的包管理工具,使用Helm Chart可以简化大模型服务的部署和配置管理。

**安装Helm**

```bash
# Linux/macOS
curl -fsSL https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz | tar -xz
sudo mv linux-amd64/helm /usr/local/bin/helm

# Windows (
使用Chocolatey)
choco install kubernetes-helm
```

**
创建Helm Chart**

```bash
helm create llm-service
cd llm-service
```

**Chart
目录结构:**

```
llm-service/
├── Chart.yaml
├── values.yaml
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   └── _helpers.tpl
└── .helmignore
```

**Chart.yaml
**

```yaml
apiVersion: v2
name: llm-service
description: A Helm chart for LLM inference service
type: application
version: 1.0.0
appVersion: "latest"
keywords:
  - llm
  - vllm
  - inference
maintainers:
  - name: LLM Team
    email: llm-team@example.com
dependencies:
  - name: common
    version: "1.x.x"
    repository: "https://charts.bitnami.com/bitnami"
```

**values.yaml
**

```yaml
replicaCount: 2

image:
  repository: vllm/vllm-openai
  tag: latest
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 8000

resources:
  limits:
    nvidia.com/gpu: 1
    memory: 32Gi
    cpu: 8
  requests:
    nvidia.com/gpu: 1
    memory: 16Gi
    cpu: 4

env:
  MODEL_NAME: "Qwen/Qwen-7B-Chat"
  GPU_MEMORY_UTILIZATION: "0.85"
  MAX_MODEL_LEN: "8192"

autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: llm.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: llm-tls
      hosts:
        - llm.example.com

persistence:
  enabled: true
  storageClass: nfs-client
  size: 100Gi

nodeSelector:
  gpu-type: nvidia-rtx4090

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - llm-service
          topologyKey: kubernetes.io/hostname
```

**
部署Helm Chart**

```bash
#
添加依赖
helm dependency build llm-service

#
部署
helm install qwen-7b llm-service \
    --namespace llm-production \
    --create-namespace \
    --values llm-service/values.yaml

#
升级
helm upgrade qwen-7b llm-service \
    --namespace llm-production \
    --values llm-service/values-prod.yaml

#
查看部署状态
helm status qwen-7b -n llm-production

#
列出部署
helm list -n llm-production

#
删除部署
helm uninstall qwen-7b -n llm-production
```

### GPU
调度与资源管理

Kubernetes原生支持GPU调度,但需要正确配置才能生效。

**检查GPU资源:**

```bash
#
查看节点GPU信息
kubectl describe nodes | grep -A 10 "nvidia.com/gpu"

#
查看GPU可用数量
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
```

**GPU
配额管理:**

```yaml
# ResourceQuota
限制命名空间GPU使用
apiVersion: v1
kind: ResourceQuota
metadata:
  name: llm-gpu-quota
  namespace: llm-production
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"
---
# LimitRange
设置默认值
apiVersion: v1
kind: LimitRange
metadata:
  name: llm-limits
  namespace: llm-production
spec:
  limits:
  - max:
      nvidia.com/gpu: 2
    default:
      nvidia.com/gpu: 1
    defaultRequest:
      nvidia.com/gpu: 1
    type: Container
```

**
时间片调度(Time Slicing):** 在共享GPU集群中,可以让多个容器分时使用同一张GPU

```yaml
# device-plugin
配置
apiVersion: apps/v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4  #
1GPU划分为4个时间片
```

###
监控与日志集成

生产环境需要完善的监控和日志系统。

**Prometheus监控:**

```yaml
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'vllm'
      static_configs:
      - targets: ['vllm-service.llm-production:8000']
      metrics_path: /metrics
```

**Grafana
仪表板:**

```yaml
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-grafana-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  vllm-dashboard.json: |
    {
      "dashboard": {
        "title": "vLLM Inference Dashboard",
        "panels": [
          {
            "title": "Requests per Second",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(vllm_requests_total[5m])",
                "legendFormat": "{{model}}"
              }
            ]
          },
          {
            "title": "Average Latency",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m]))",
                "legendFormat": "p95"
              }
            ]
          },
          {
            "title": "GPU Memory Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "vllm_gpu_memory_usage_bytes / vllm_gpu_memory_total_bytes",
                "legendFormat": "Used {{gpu}}"
              }
            ]
          }
        ]
      }
    }
```

##
实战案例:docker-compose一键部署Qwen-7B推理服务

4
deployment_flow.png

###
环境准备

在开始部署前,需要确保环境满足以下要求:

**硬件要求:** NVIDIA GPU(至少16GB显存,推荐RTX 4080或更高),8核以上CPU32GB系统内存,100GB可用磁盘空间。

**软件要求:** Ubuntu 20.04+CentOS 8+NVIDIA Driver 535+Docker 24+NVIDIA Container Toolkit

**环境验证:**

```bash
#
检查NVIDIA驱动
nvidia-smi
#
预期输出应显示GPU型号和驱动版本

# 检查Docker
docker --version
#
预期输出:Docker version 24.x.x

#
检查nvidia-container-toolkit
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
#
预期输出应显示GPU信息
```

###
编写docker-compose配置

创建项目目录并编写docker-compose.yml

```bash
mkdir -p ~/llm-deploy/qwen-7b
cd ~/llm-deploy/qwen-7b
```

```yaml
# docker-compose.yml
version: '3.8'

services:
  # Qwen-7B
推理服务
  qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    hostname: qwen7b-inference
    ports:
      - "8000:8000"
    environment:
      #
模型配置
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - TOKENIZER=Qwen/Qwen-7B-Chat
      #
性能优化
      - GPU_MEMORY_UTILIZATION=0.9
      - MAX_MODEL_LEN=8192
      - MAX_NUM_BATCHED_TOKENS=8192
      - MAX_NUM_SEQUENCES=256
      #
推理优化
      - ENABLE_CHUNKED_PREFILL=true
      - ENABLE_PREFIX_CACHING=true
      # GPU
配置
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=all
      #
日志配置
      - LOG_LEVEL=INFO
    volumes:
      # HuggingFace
模型缓存(避免重复下载)
      - model_cache:/root/.cache/huggingface
      # vLLM
服务日志
      - ./logs:/var/log/vllm
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb'  #
共享内存(对vLLM重要)
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    networks:
      - qwen-net

  # API
网关(Nginx
  api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      qwen7b-inference:
        condition: service_healthy
    restart: unless-stopped
    networks:
      - qwen-net

  # Prometheus
监控
  prometheus:
    image: prom/prometheus:latest
    container_name: qwen7b-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    restart: unless-stopped
    networks:
      - qwen-net

  # Grafana
可视化
  grafana:
    image: grafana/grafana:latest
    container_name: qwen7b-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - ./grafana/data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped
    networks:
      - qwen-net

networks:
  qwen-net:
    driver: bridge

volumes:
  model_cache:
```

### Nginx
配置

```bash
mkdir -p nginx ssl prometheus grafana/provisioning/datasources grafana/provisioning/dashboards grafana/dashboards logs
```

```nginx
# nginx/nginx.conf
events {
    worker_connections 1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    #
日志格式
    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$http_x_forwarded_for" '
                    'rt=$request_time uct="$upstream_connect_time" '
                    'uht="$upstream_header_time" urt="$upstream_response_time"';

    access_log /var/log/nginx/access.log main;
    error_log /var/log/nginx/error.log warn;

    #
性能优化
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    # Gzip
压缩
    gzip on;
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_types text/plain text/css text/xml application/json application/javascript application/rss+xml application/atom+xml image/svg+xml;

    #
限流配置
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/s;
    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

    upstream vllm_backend {
        server qwen7b-inference:8000;
        keepalive 64;
    }

    server {
        listen 80;
        server_name _;

        #
健康检查
        location /health {
            access_log off;
            return 200 "OK";
            add_header Content-Type text/plain;
        }

        # OpenAI
兼容API - Chat Completions
        location /v1/chat/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            #
代理头设置
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            #
流式响应(关键配置)
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_connect_timeout 60s;
            proxy_send_timeout 300s;

            # SSE
必需配置
            proxy_set_header Connection '';
            chunked_transfer_encoding on;

            #
记录响应时间
            add_header X-Response-Time $upstream_response_time;
        }

        # OpenAI
兼容API - Completions
        location /v1/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_set_header Connection '';
            chunked_transfer_encoding on;
        }

        #
模型列表
        location /v1/models {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }

        # vLLM
直接接口
        location /v1 {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }

        # Prometheus
指标(仅内网访问)
        location /metrics {
            proxy_pass http://prometheus:9090;
            proxy_set_header Host $host;
            allow 127.0.0.1;
            allow 10.0.0.0/8;
            allow 172.16.0.0/12;
            deny all;
        }
    }
}
```

### Prometheus
配置

```yaml
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['qwen7b-inference:8000']
    metrics_path: '/metrics'
    scrape_interval: 10s
```

###
启动服务

```bash
#
进入项目目录
cd ~/llm-deploy/qwen-7b

#
拉取镜像(提前完成,节省首次启动时间)
docker-compose pull

#
启动服务(后台运行)
docker-compose up -d

#
查看服务状态
docker-compose ps

#
查看日志
docker-compose logs -f qwen7b-inference

#
等待服务就绪后测试
# 首次启动需要下载模型,可能需要几分钟
sleep 120

#
健康检查
curl http://localhost:8000/health

#
模型列表
curl http://localhost:8000/v1/models

#
测试推理
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen-7B-Chat",
    "messages": [
      {"role": "system", "content": "
你是一个有帮助的AI助手。"},
      {"role": "user", "content": "
请介绍一下Docker容器化技术的优势。"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'
```

###
性能测试

```bash
#
安装测试工具
pip install locust

#
创建负载测试脚本 locustfile.py
cat > locustfile.py << 'EOF'
from locust import HttpUser, task, between
import json

class LLMUser(HttpUser):
    wait_time = between(1, 3)
    host = "http://localhost"

    @task(3)
    def chat_completion(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "system", "content": "
你是一个有帮助的AI助手。"},
                {"role": "user", "content": "
请用100字介绍一下人工智能。"}
            ],
            "max_tokens": 200,
            "temperature": 0.7,
            "stream": False
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")

    @task(1)
    def chat_completion_stream(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "user", "content": "
解释一下什么是机器学习。"}
            ],
            "max_tokens": 256,
            "temperature": 0.7,
            "stream": True
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")
EOF

#
运行负载测试(10个并发用户,持续60秒)
locust -f locustfile.py --headless -u 10 -r 2 -t 60s --csv results

# Web
界面测试(可选)
locust -f locustfile.py --web-port 8089
```

##
容器化后的Spring Boot对接方案

### 概述

在企业级应用中,Java/Spring Boot仍然是主流的后端开发框架。将大模型推理服务容器化部署后,如何在Spring Boot应用中对接使用是开发者普遍关心的问题。

### RestTemplate对接方案

最直接的对接方式是使用SpringRestTemplateWebClient调用大模型API

```java
package com.example.llm.config;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.http.client.SimpleClientHttpRequestFactory;
import org.springframework.web.client.RestTemplate;

@Configuration
public class RestTemplateConfig {

    @Bean
    public RestTemplate restTemplate() {
        SimpleClientHttpRequestFactory factory = new SimpleClientHttpRequestFactory();
        factory.setConnectTimeout(10000);      //
连接超时10
        factory.setReadTimeout(120000);         // 读取超时120秒(大模型推理可能较长)
        factory.setBufferRequestBody(true);

        return new RestTemplate(factory);
    }
}
```

### OpenAI
兼容客户端封装

为了方便对接支持OpenAI协议的大模型服务,封装一个统一的客户端:

```java
package com.example.llm.client;

import com.fasterxml.jackson.annotation.JsonProperty;
import lombok.Data;
import lombok.Builder;
import lombok.AllArgsConstructor;
import lombok.NoArgsConstructor;

import java.util.List;
import java.util.Map;

public class LLMClient {

    private final RestTemplate restTemplate;
    private final String baseUrl;

    public LLMClient(RestTemplate restTemplate, String baseUrl) {
        this.restTemplate = restTemplate;
        this.baseUrl = baseUrl;
    }

    /**
     *
聊天补全
     */
    public ChatCompletionResponse chatCompletion(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ResponseEntity<ChatCompletionResponse> response = restTemplate.exchange(
            url,
            HttpMethod.POST,
            entity,
            ChatCompletionResponse.class
        );

        return response.getBody();
    }

    /**
     *
流式聊天补全
     */
    public Flux<String> chatCompletionStream(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);
        headers.setAccept(MediaType.TEXT_EVENT_STREAM);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ParameterizedTypeReference<ServerSentEvent<String>> type = new ParameterizedTypeReference<>() {};

        return restTemplate.exchange(
                url,
                HttpMethod.POST,
                entity,
                new ParameterizedTypeReference<ServerSentEvent<String>>() {}
            )
            .getBody()
            .stream()
            .map(ServerSentEvent::data);
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionRequest {
        private String model;
        private List<Message> messages;
        private Double temperature;
        @JsonProperty("max_tokens")
        private Integer maxTokens;
        @JsonProperty("top_p")
        private Double topP;
        private Boolean stream;
        private List<String> stop;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Message {
        private String role;
        private String content;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionResponse {
        private String id;
        private String object;
        private Long created;
        private String model;
        private List<Choice> choices;
        private Usage usage;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Choice {
        private Integer index;
        private Message message;
        @JsonProperty("finish_reason")
        private String finishReason;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Usage {
        @JsonProperty("prompt_tokens")
        private Integer promptTokens;
        @JsonProperty("completion_tokens")
        private Integer completionTokens;
        @JsonProperty("total_tokens")
        private Integer totalTokens;
    }
}
```

### Spring Boot
服务实现

```java
package com.example.llm.service;

import com.example.llm.client.LLMClient.ChatCompletionRequest;
import com.example.llm.client.LLMClient.ChatCompletionResponse;
import com.example.llm.client.LLMClient.LLMClient;
import com.example.llm.client.LLMClient.Message;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;

import java.util.Arrays;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;

@Service
@RequiredArgsConstructor
@Slf4j
public class LLMService {

    private final LLMClient llmClient;

    /**
     *
同步对话
     */
    public String chat(String userMessage) {
        return chat(userMessage, null);
    }

    /**
     *
同步对话(带系统提示)
     */
    public String chat(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, false);

        try {
            ChatCompletionResponse response = llmClient.chatCompletion(request);

            if (response != null && response.getChoices() != null && !response.getChoices().isEmpty()) {
                return response.getChoices().get(0).getMessage().getContent();
            }
        } catch (Exception e) {
            log.error("LLM
调用失败", e);
            throw new RuntimeException("LLM
服务调用失败: " + e.getMessage(), e);
        }

        return null;
    }

    /**
     *
流式对话
     */
    public Flux<String> chatStream(String userMessage) {
        return chatStream(userMessage, null);
    }

    /**
     *
流式对话(带系统提示)
     */
    public Flux<String> chatStream(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, true);

        return llmClient.chatCompletionStream(request)
            .filter(data -> !data.startsWith("[DONE]"))
            .map(data -> {
                //
解析SSE数据
                if (data.startsWith("data:")) {
                    data = data.substring(5).trim();
                }
                return data;
            });
    }

    private ChatCompletionRequest buildRequest(String userMessage, String systemPrompt, boolean stream) {
        List<Message> messages = Arrays.asList(
            Message.builder()
                .role("system")
                .content(systemPrompt != null ? systemPrompt : "
你是一个有帮助的AI助手。")
                .build(),
            Message.builder()
                .role("user")
                .content(userMessage)
                .build()
        );

        return ChatCompletionRequest.builder()
            .model("Qwen/Qwen-7B-Chat")
            .messages(messages)
            .temperature(0.7)
            .maxTokens(2048)
            .stream(stream)
            .build();
    }

    /**
     *
异步对话
     */
    public CompletableFuture<String> chatAsync(String userMessage) {
        return CompletableFuture.supplyAsync(() -> chat(userMessage))
            .orTimeout(120, TimeUnit.SECONDS);
    }

    /**
     *
批量对话
     */
    public List<String> batchChat(List<String> messages) {
        return messages.parallelStream()
            .map(this::chat)
            .toList();
    }
}
```

###
控制器实现

```java
package com.example.llm.controller;

import com.example.llm.service.LLMService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.MediaType;
import org.springframework.http.codec.ServerSentEvent;
import org.springframework.web.bind.annotation.*;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;

import java.util.HashMap;
import java.util.Map;

@RestController
@RequestMapping("/api/llm")
@RequiredArgsConstructor
@Slf4j
public class LLMController {

    private final LLMService llmService;

    /**
     *
同步对话接口
     */
    @PostMapping("/chat")
    public Map<String, Object> chat(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("
收到聊天请求: {}", userMessage);

        String response = llmService.chat(userMessage, systemPrompt);

        Map<String, Object> result = new HashMap<>();
        result.put("code", 200);
        result.put("message", "success");
        result.put("data", response);

        return result;
    }

    /**
     *
流式对话接口
     */
    @PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<ServerSentEvent<String>> chatStream(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("
收到流式聊天请求: {}", userMessage);

        return llmService.chatStream(userMessage, systemPrompt)
            .map(data -> ServerSentEvent.<String>builder()
                .data(data)
                .build()
            )
            .onErrorResume(e -> {
                log.error("
流式对话异常", e);
                return Mono.just(ServerSentEvent.<String>builder()
                    .data("[ERROR]" + e.getMessage())
                    .build());
            })
            .concatWith(Mono.just(ServerSentEvent.<String>builder()
                .data("[DONE]")
                .build()));
    }

    /**
     *
异步对话接口
     */
    @PostMapping("/chat/async")
    public Map<String, Object> chatAsync(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");

        log.info("
收到异步聊天请求: {}", userMessage);

        //
立即返回,后续异步处理
        llmService.chatAsync(userMessage)
            .thenAccept(response -> {
                log.info("
异步任务完成: {}", response);
                //
这里可以添加回调逻辑,如发送WebSocket消息
            })
            .exceptionally(e -> {
                log.error("
异步任务失败", e);
                return null;
            });

        Map<String, Object> result = new HashMap<>();
        result.put("code", 202);
        result.put("message", "
请求已接受,正在异步处理");

        return result;
    }
}
```

###
熔断与重试配置

```java
package com.example.llm.config;

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.retry.RetryConfig;
import lombok.extern.slf4j.Slf4j;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.client.RestTemplate;

import java.time.Duration;

@Configuration
@Slf4j
public class ResilienceConfig {

    @Bean
    public CircuitBreaker llmCircuitBreaker() {
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)                    //
失败率阈值50%
            .slowCallRateThreshold(80)                    //
慢调用率阈值80%
            .slowCallDurationThreshold(Duration.ofSeconds(30))  //
慢调用阈值30
            .waitDurationInOpenState(Duration.ofMinutes(1))     // 熔断器开放时间1分钟
            .permittedNumberOfCallsInHalfOpenState(3)    // 半开状态允许调用次数
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .slidingWindowSize(10)                        //
滑动窗口大小
            .minimumNumberOfCalls(5)                      // 最小调用次数
            .build();

        return CircuitBreaker.of("llmService", config);
    }

    @Bean
    public Retry llmRetry() {
        RetryConfig config = RetryConfig.custom()
            .maxAttempts(3)                               //
最大重试次数
            .waitDuration(Duration.ofSeconds(2))          // 重试间隔
            .retryExceptions(Exception.class)             // 重试的异常类型
            .ignoreExceptions(IllegalArgumentException.class)  // 忽略的异常
            .build();

        return Retry.of("llmService", config);
    }
}
```

###
应用配置

```yaml
# application.yml
spring:
  application:
    name: llm-spring-boot-client

server:
  port: 8080

llm:
  #
大模型服务地址
  base-url: http://localhost:80
  #
连接超时(毫秒)
  connect-timeout: 10000
  #
读取超时(毫秒)
  read-timeout: 120000

#
熔断器配置
resilience4j:
  circuitbreaker:
    configs:
      default:
        failureRateThreshold: 50
        waitDurationInOpenState: 60s
    instances:
      llmService:
        registerHealthIndicator: true
        slidingWindowSize: 10

#
日志配置
logging:
  level:
    com.example.llm: DEBUG
    org.springframework.web.client: DEBUG
```

### Docker
Spring Boot应用

如果需要将Spring Boot应用也容器化部署,可以编写Dockerfile

```dockerfile
# Multi-stage build
FROM eclipse-temurin:17-jdk-alpine AS builder

WORKDIR /app

#
复制Maven配置
COPY pom.xml .
COPY src ./src

#
下载依赖(利用缓存)
RUN ./mvnw dependency:go-offline -B

#
构建
RUN ./mvnw package -DskipTests -B

#
运行阶段
FROM eclipse-temurin:17-jre-alpine

WORKDIR /app

#
安全配置
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
RUN chown -R appuser:appgroup /app

#
复制jar
COPY --from=builder /app/target/*.jar app.jar

#
复制模型缓存(可选,共享HuggingFace缓存)
VOLUME ["/root/.cache/huggingface"]

USER appuser

EXPOSE 8080

ENTRYPOINT ["java", "-jar", "-Xms2g", "-Xmx4g", "-XX:+UseG1GC", "app.jar"]
```


相应的docker-compose配置:

```yaml
version: '3.8'

services:
  spring-boot-app:
    build: ./spring-boot-app
    container_name: spring-boot-llm-client
    ports:
      - "8080:8080"
    environment:
      - LLM_BASE_URL=http://qwen7b-gateway:80
      - SPRING_PROFILES_ACTIVE=production
    depends_on:
      - api-gateway
    networks:
      - qwen-net

  api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - qwen7b-inference
    networks:
      - qwen-net

  qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.9
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb'
    networks:
      - qwen-net

networks:
  qwen-net:
    driver: bridge

volumes:
  model_cache:
```

##
总结

Docker容器化技术为大模型的部署提供了标准化、可复制、易扩展的解决方案。通过本文的详细介绍,我们已经覆盖了从环境配置、镜像制作、服务编排到生产部署的完整流程。

**核心要点回顾:**

1. **
环境配置**:正确安装nvidia-container-toolkitGPU容器化的前提,CUDA版本兼容性需要仔细规划。

2. **镜像制作**:选择合适的Base Image,合理利用多阶段构建减小镜像体积,通过层缓存优化构建速度。

3. **服务编排**Docker Compose适合单主机多容器编排,Kubernetes适合大规模生产部署。

4. **性能优化**:合理的内存配置、批量推理、流式输出等优化手段可显著提升推理吞吐量。

5. **安全加固**:以非root用户运行、限制容器能力、启用TLS加密、进行安全扫描是基本的安全实践。

6. **Spring Boot对接**:通过封装统一的API客户端,Spring Boot应用可以方便地调用容器化部署的大模型服务。

随着容器化和云原生技术的持续发展,大模型的部署将变得更加简单高效。希望本文能够为你的大模型生产部署提供有价值的参考。

3

大模型Docker镜像制作

Base Image选择策略

选择合适的基础镜像是构建大模型Docker镜像的第一步。NVIDIA官方维护的CUDA基础镜像是最佳选择,它们经过严格测试,包含了经过验证的驱动和库组合。

**Ubuntu基础镜像 vs CentOS基础镜像:** Ubuntu是目前主流的Docker镜像操作系统,拥有更广泛的软件包支持和更快的安全更新。对于需要频繁更新安全补丁的生产环境,Ubuntu是更好的选择。CentOS则以其稳定性著称,适合对系统一致性要求极高的企业级场景。

**-devel镜像 vs runtime镜像:** NVIDIA CUDA镜像分为`devel`和`runtime`两个版本。`devel`镜像包含完整的编译工具链(gcc、nvcc等),体积较大(约十几GB),适合需要编译CUDA程序或构建依赖CUDA的程序场景。`runtime`镜像只包含运行时库,体积较小(约5GB),适合直接运行已编译好的CUDA程序。对于大模型推理服务,推荐使用`runtime`镜像以减小镜像体积和攻击面。

** alpine基础镜像的局限性:** 部分大模型官方镜像使用alpine作为基础系统以减小体积。但alpine的libc为musl而非glibc,可能导致某些Python扩展(如numpy、scipy的底层C库)编译或运行问题。如果遇到奇怪的动态链接错误,可能需要切换到Ubuntu或CentOS基础镜像。

CUDA版本与依赖环境配置

构建大模型Docker镜像时,需要在Dockerfile中显式指定CUDA版本和必要的依赖环境。以下是一个典型的Dockerfile结构:


### 集群模式部署配置

对于大规模推理服务,需要支持多实例部署和负载均衡。以下是一个支持水平扩展的集群部署配置:

```yaml
# docker-compose-cluster.yml
version: '3.8'

services:
  # vLLM
推理实例1
  vllm-1:
    image: vllm/vllm-openai:latest
    container_name: vllm-1
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.85
      - MAX_MODEL_LEN=8192
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    ports:
      - "8001:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 180s

  # vLLM
推理实例2
  vllm-2:
    image: vllm/vllm-openai:latest
    container_name: vllm-2
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.85
      - MAX_MODEL_LEN=8192
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-net
    ports:
      - "8002:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 180s

  # HAProxy
负载均衡
  haproxy:
    image: haproxy:alpine
    container_name: haproxy
    ports:
      - "80:8080"
      - "8404:8404"  # HAProxy stats page
    volumes:
      - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
    depends_on:
      - vllm-1
      - vllm-2
    networks:
      - llm-net
    healthcheck:
      test: ["CMD", "haproxy", "-c", "-f", "/usr/local/etc/haproxy/haproxy.cfg"]
      interval: 30s
      timeout: 10s
      retries: 3

  # Redis Session
存储
  redis:
    image: redis:7-alpine
    container_name: redis-session
    command: redis-server --appendonly yes --maxmemory 4gb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    networks:
      - llm-net
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

networks:
  llm-net:
    driver: bridge

volumes:
  model_cache:
  redis_data:
```

HAProxy
配置(haproxy.cfg):

```cfg
global
    log stdout format raw local0
    maxconn 4096
    stats socket /var/run/hapee-listen.sock mode 600 level admin
    stats timeout 2m

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    option  http-server-close
    option  forwardfor except 127.0.0.0/8
    timeout connect 30s
    timeout client  180s
    timeout server  300s
    timeout tunnel  600s
    timeout check   5s

#
统计页面
listen stats
    bind *:8404
    mode http
    stats enable
    stats uri /
    stats refresh 30s

#
推理服务负载均衡
frontend llm_frontend
    bind *:8080
    default_backend llm_backend

backend ll
m_backend
    mode http
    balance roundrobin
    option tcp-check
    tcp-check connect
    tcp-check expect string X-Request-Id

    # vLLM
实例
    server vllm1 vllm-1:8000 check inter 30s rise 2 fall 3 weight 100
    server vllm2 vllm-2:8000 check inter 30s rise 2 fall 3 weight 100

    #
健康检查
    http-check expect status 200
    option httpchk GET /health
    http-check expect string OK

    #
响应超时
    timeout server 300s

    #
会话保持
    stick-table type string len 32 size 100k expire 30m
    stick on req.cookie(X-Session-Id)

#
监控端点
frontend monitoring
    bind *:8081
    mode http
    default_backend metrics_backend

backend metrics_backend
    mode http
    stats enable
    stats uri /stats
    stats realm Haproxy\ Statistics
    stats auth admin:admin
```

##
容器化性能优化

### 内存与显存配置优化

大模型推理是内存密集型任务,合理配置内存和显存参数对性能影响显著。以下是各主要推理框架的内存优化配置:

**vLLM内存配置:**

```yaml
environment:
  - GPU_MEMORY_UTILIZATION=0.9        # GPU
显存使用率(默认0.9
  - MAX_MODEL_LEN=8192                # 最大模型长度
  - KV_CACHE_FP8_QUANTIZED=true      # KV缓存量化(需要GPU支持)
  - BLOCK_SIZE=16                     # KV缓存块大小
  - NUM_PREEMPTIVE_SAMPLES=4         # 抢占式采样数
```

`GPU_MEMORY_UTILIZATION`
控制vLLM可以使用的GPU显存比例。默认值0.9预留10%显存用于KV缓存和其他开销。如果遇到OOM错误,可以适当降低到0.80.7

**PyTorch内存配置:**

```python
import torch

#
开启梯度检查点以节省显存
model.gradient_checkpointing_enable()

#
设置PyTorch内存分配器策略
torch.cuda.set_per_process_memory_fraction(0.9)

#
启用TF32计算(Ampere架构GPU
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
```

**Transformer
模型内存优化:**

```python
from transformers import AutoModelForCausalLM

#
启用Flash Attention(需要安装flash-attn
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    torch_dtype=torch.float16,
    use_flash_attention_2=True  #
需要flash-attn
)

# 4-bit
量化加载
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    quantization_config=quantization_config,
    device_map="auto"
)
```

### GPU
分配策略

在多GPU环境下,选择合适的GPU分配策略对整体吞吐量至关重要。

**GPU单容器**是最简单的部署模式,适合模型可以完整加载到单张GPU的场景。配置简单,但GPU利用率可能不够高:

```yaml
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]
```

**
GPU张量并行**将单个模型的权重分布到多张GPU上,适合超大模型或需要极致吞吐量的场景:

```bash
#
使用vLLM的张量并行
docker run --gpus '"device=0,1,2,3"' \
    -e TENSOR_PARALLEL_SIZE=4 \
    vllm/vllm-openai:latest
```

**
多实例负载均衡**在每张GPU上运行独立的模型实例,通过负载均衡器分发请求。这种方式的吞吐量通常高于张量并行,因为每个实例可以独立进行批量处理:

```yaml
services:
  vllm-gpu0:
    # ...
配置...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0

  vllm-gpu1:
    # ...
配置...
   
deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=1
```

###
批量推理优化

批量推理是提高GPU利用率的有效手段。合理的批处理配置可以将吞吐量提升数倍。

**动态批处理(Dynamic Batching**由推理服务框架自动管理,自动将多个请求合并成批次处理:

```yaml
environment:
  # vLLM
动态批处理配置
  - MAX_NUM_BATCHED_TOKENS=8192      # 单批次最大token
  - MAX_NUM_SEQUENCES=256            # 单批次最大序列数
  - SERVING_NAME=Qwen-7B             # 模型名称
```

**
连续批处理(Continuous Batching**vLLM的默认批处理策略,允许在生成过程中动态添加新请求,无需等待整个批次完成:

```python
# vLLM
连续批处理示例
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen-7B-Chat")

#
批量推理请求
prompts = [
    "
请介绍一下Docker: ",
    "
什么是Kubernetes: ",
    "
解释一下微服务架构: "
]

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["\n", "
用户:", "assistant:"]
)

#
批量生成
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**
预批处理(Prefix Caching**对于具有相同系统提示的对话场景,可以复用已计算的KV缓存,显著降低首token延迟:

```yaml
environment:
  - ENABLE_PREFIX_CACHING=true    #
启用前缀缓存
  - NUM_SCHEDULER_STEPS=8          # 调度步数
```

###
推理延迟优化

在线推理场景对延迟敏感,需要综合优化以降低响应时间。

**流式输出**允许在生成过程中就开始返回token,显著降低用户的感知延迟:

```python
#
使用流式输出
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen-7B-Chat")
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
    stream=True  #
启用流式输出
)

#
流式生成
for output in llm.generate(["请介绍一下人工智能的发展历史"], sampling_params):
    # output
包含已生成的token
    print(output.outputs[0].text, end="", flush=True)
```

**KV
缓存量化**可以显著减少显存占用,允许更大的批量或更长的上下文:

```yaml
environment:
  - KV_CACHE_DTYPE=fp8              # KV
缓存数据类型
```

**CUDA Graph**
通过捕获和重放GPU操作来减少内核启动开销:

```yaml
environment:
  - ENABLE_CHUNKED_PREFILL=false     #
禁用分块预填充以降低延迟
```

## Docker
安全加固

### 容器安全基础配置

大模型推理服务通常处理敏感数据,容器安全加固至关重要。

**以非root用户运行容器**是基本的安全最佳实践:

```dockerfile
# Dockerfile
中添加
RUN groupadd -r llmuser && useradd -r -g llmuser llmuser

# ...
安装依赖 ...

USER llmuser
WORKDIR /home/llmuser

CMD ["python", "app.py"]
```

docker-compose.yml
中也可以指定运行用户:

```yaml
services:
  vllm-backend:
    image: vllm/vllm-openai:latest
    user: "1000:1000"  #
指定UID:GID
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
```

**
限制容器能力**通过`cap_drop`移除不必要的Linux能力:

```yaml
services:
  vllm-backend:
    image: vllm/vllm-openai:latest
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE  #
仅添加需要的capability
```

**
启用SELinux/AppArmor**为容器提供额外的访问控制:

```bash
# Docker
运行时的AppArmor配置
docker run --security-opt "apparmor=docker-default" \
    --gpus all \
    vllm/vllm-openai:latest
```

###
网络安全配置

限制容器的网络访问可以减少攻击面。

**使用自定义网络**隔离容器通信:

```yaml
networks:
  llm-internal:
    driver: bridge
    internal: true  #
禁止容器访问外部网络

  llm-public:
    driver: bridge
```


不同服务分配到不同网络层:

```yaml
services:
  #
仅内部网络访问的推理服务
  vllm-backend:
    networks:
      - llm-internal
    #
不暴露任何端口

  # 可访问公网的API网关
  api-gateway:
    networks:
      - llm-public
      - llm-internal
    ports:
      - "80:80"
```

**TLS/SSL
加密**保护传输中的数据:

```yaml
services:
  api-gateway:
    image: nginx:alpine
    volumes:
      - ./ssl/cert.pem:/etc/nginx/ssl/cert.pem:ro
      - ./ssl/key.pem:/etc/nginx/ssl/key.pem:ro
    command: |
      nginx -c /etc/nginx/nginx.conf -g 'daemon off;'
```

###
镜像安全扫描

构建和部署前扫描镜像漏洞是安全流程的重要环节。

**使用Trivy扫描镜像:**

```bash
#
安装Trivy
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    aquasec/trivy image vllm/vllm-openai:latest

#
扫描已知漏洞
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    aquasec/trivy image --severity HIGH,CRITICAL \
    vllm/vllm-openai:latest

#
输出漏洞报告
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    -v ./reports:/reports \
    aquasec/trivy image \
    --format json --output /reports/vuln-report.json \
    vllm/vllm-openai:latest
```

**Docker Scout**
Docker官方提供的镜像安全分析工具:

```bash
#
登录Docker Hub
docker scout login

#
分析镜像
docker scout cves vllm/vllm-openai:latest

#
比较两个镜像版本
docker scout compare \
    --from vllm/vllm-openai:0.1 \
    --to vllm/vllm-openai:0.2
```

**CI/CD
集成安全扫描:**

```yaml
# .github/workflows/security-scan.yml
name: Security Scan

on: [push, pull_request]

jobs:
  trivy-scan:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'vllm/vllm-openai:latest'
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'

      - name: Upload Trivy results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'
```

###
运行时安全监控

部署后的容器安全监控同样重要。

**Docker内置安全审计:**

```bash
#
查看容器运行状态
docker stats

#
检查容器权限
docker inspect --format '{{.HostConfig.Privileged}}' <container>

#
查看容器的Linux capabilities
docker inspect --format '{{.HostConfig.CapAdd}}' <container>
```

**Falco**
CNCF旗下的容器运行时安全监控工具:

```yaml
# docker-compose-falco.yml
version: '3.8'

services:
  falco:
    image: falcosecurity/falco:latest
    container_name: falco
    privileged: true
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /host:/host:ro
      - ./falco-rules.yaml:/etc/falco/falco_rules.yaml:ro
    environment:
      - FALCO
菲尔斯_ENABLED=true
    network_mode: host
```

Falco
规则示例(falco-rules.yaml):

```yaml
#
检测容器内运行shell
- rule: Shell spawned in container
  desc: A shell was spawned by a non-shell program in a container
  condition: >
    spawned_process and
    container and
    proc.name = bash
  output: >
    Shell spawn detected (user=%user.name container_id=%container.id
    container_name=%container.name shell=%proc.name parent=%proc.pname
    cmdline=%proc.cmdline)
  priority: NOTICE

#
检测特权容器
- rule: Privileged container started
  desc: Privileged container started
  condition: >
    container and
    container.privileged = true
  output: >
    Privileged container started (id=%container.id
    image=%container.image.repository
    user=%user.name)
  priority: CRITICAL
```

## Kubernetes
部署大模型

### kubectl基础配置

KubernetesK8s)是生产级容器编排的事实标准,掌握kubectl基础操作是部署大模型的必备技能。

**集群上下文配置:**

```bash
#
查看当前集群上下文
kubectl config current-context

#
列出所有上下文
kubectl config get-contexts

#
切换上下文
kubectl config use-context production-cluster

#
查看集群信息
kubectl cluster-info
```

**Namespace
管理:**

```bash
#
创建命名空间
kubectl create namespace llm-production

#
切换默认命名空间
kubectl config set-context --current --namespace=llm-production

#
在指定命名空间操作
kubectl get pods -n llm-production
```

**Deployment
创建:**

```yaml
# llm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen-7b-deployment
  namespace: llm-production
  labels:
    app: qwen-7b
    version: v1
spec:
  replicas: 2
  selector:
    matchLabels:
      app: qwen-7b
  template:
    metadata:
      labels:
        app: qwen-7b
        version: v1
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: MODEL_NAME
          value: "Qwen/Qwen-7B-Chat"
        - name: GPU_MEMORY_UTILIZATION
          value: "0.85"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "4"
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: CUDA_VISIBLE_DEVICES
          value: "all"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      nodeSelector:
        gpu-type: nvidia-rtx4090
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
```

**Service
创建:**

```yaml
# llm-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: qwen-7b-service
  namespace: llm-production
spec:
  type: ClusterIP
  selector:
    app: qwen-7b
  ports:
  - name: http
    port: 8000
    targetPort: 8000
    protocol: TCP
```

**
应用配置:**

```bash
kubectl apply -f llm-deployment.yaml
kubectl apply -f llm-service.yaml

#
查看部署状态
kubectl get deployments -n llm-production
kubectl get pods -n llm-production

#
查看日志
kubectl logs -n llm-production -l app=qwen-7b --tail=100

#
扩缩容
kubectl scale deployment qwen-7b-deployment --replicas=4 -n llm-production

#
更新镜像版本
kubectl set image deployment/qwen-7b-deployment \
    vllm=vllm/vllm-openai:v0.2.0 -n llm-production

#
回滚
kubectl rollout undo deployment/qwen-7b-deployment -n llm-production
```

### Helm Chart
部署

HelmKubernetes的包管理工具,使用Helm Chart可以简化大模型服务的部署和配置管理。

**安装Helm**

```bash
# Linux/macOS
curl -fsSL https://get.helm.sh/helm-v3.14.0-linux-amd64.tar.gz | tar -xz
sudo mv linux-amd64/helm /usr/local/bin/helm

# Windows (
使用Chocolatey)
choco install kubernetes-helm
```

**
创建Helm Chart**

```bash
helm create llm-service
cd llm-service
```

**Chart
目录结构:**

```
llm-service/
├── Chart.yaml
├── values.yaml
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   └── _helpers.tpl
└── .helmignore
```

**Chart.yaml
**

```yaml
apiVersion: v2
name: llm-service
description: A Helm chart for LLM inference service
type: application
version: 1.0.0
appVersion: "latest"
keywords:
  - llm
  - vllm
  - inference
maintainers:
  - name: LLM Team
    email: llm-team@example.com
dependencies:
  - name: common
    version: "1.x.x"
    repository: "https://charts.bitnami.com/bitnami"
```

**values.yaml
**

```yaml
replicaCount: 2

image:
  repository: vllm/vllm-openai
  tag: latest
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 8000

resources:
  limits:
    nvidia.com/gpu: 1
    memory: 32Gi
    cpu: 8
  requests:
    nvidia.com/gpu: 1
    memory: 16Gi
    cpu: 4

env:
  MODEL_NAME: "Qwen/Qwen-7B-Chat"
  GPU_MEMORY_UTILIZATION: "0.85"
  MAX_MODEL_LEN: "8192"

autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: llm.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: llm-tls
      hosts:
        - llm.example.com

persistence:
  enabled: true
  storageClass: nfs-client
  size: 100Gi

nodeSelector:
  gpu-type: nvidia-rtx4090

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - llm-service
          topologyKey: kubernetes.io/hostname
```

**
部署Helm Chart**

```bash
#
添加依赖
helm dependency build llm-service

#
部署
helm install qwen-7b llm-service \
    --namespace llm-production \
    --create-namespace \
    --values llm-service/values.yaml

#
升级
helm upgrade qwen-7b llm-service \
    --namespace llm-production \
    --values llm-service/values-prod.yaml

#
查看部署状态
helm status qwen-7b -n llm-production

#
列出部署
helm list -n llm-production

#
删除部署
helm uninstall qwen-7b -n llm-production
```

### GPU
调度与资源管理

Kubernetes原生支持GPU调度,但需要正确配置才能生效。

**检查GPU资源:**

```bash
#
查看节点GPU信息
kubectl describe nodes | grep -A 10 "nvidia.com/gpu"

#
查看GPU可用数量
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
```

**GPU
配额管理:**

```yaml
# ResourceQuota
限制命名空间GPU使用
apiVersion: v1
kind: ResourceQuota
metadata:
  name: llm-gpu-quota
  namespace: llm-production
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"
---
# LimitRange
设置默认值
apiVersion: v1
kind: LimitRange
metadata:
  name: llm-limits
  namespace: llm-production
spec:
  limits:
  - max:
      nvidia.com/gpu: 2
    default:
      nvidia.com/gpu: 1
    defaultRequest:
      nvidia.com/gpu: 1
    type: Container
```

**
时间片调度(Time Slicing):** 在共享GPU集群中,可以让多个容器分时使用同一张GPU

```yaml
# device-plugin
配置
apiVersion: apps/v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4  #
1GPU划分为4个时间片
```

###
监控与日志集成

生产环境需要完善的监控和日志系统。

**Prometheus监控:**

```yaml
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'vllm'
      static_configs:
      - targets: ['vllm-service.llm-production:8000']
      metrics_path: /metrics
```

**Grafana
仪表板:**

```yaml
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-grafana-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  vllm-dashboard.json: |
    {
      "dashboard": {
        "title": "vLLM Inference Dashboard",
        "panels": [
          {
            "title": "Requests per Second",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(vllm_requests_total[5m])",
                "legendFormat": "{{model}}"
              }
            ]
          },
          {
            "title": "Average Latency",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.95, rate(vllm_request_duration_seconds_bucket[5m]))",
                "legendFormat": "p95"
              }
            ]
          },
          {
            "title": "GPU Memory Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "vllm_gpu_memory_usage_bytes / vllm_gpu_memory_total_bytes",
                "legendFormat": "Used {{gpu}}"
              }
            ]
          }
        ]
      }
    }
```

##
实战案例:docker-compose一键部署Qwen-7B推理服务

4
deployment_flow.png

###
环境准备

在开始部署前,需要确保环境满足以下要求:

**硬件要求:** NVIDIA GPU(至少16GB显存,推荐RTX 4080或更高),8核以上CPU32GB系统内存,100GB可用磁盘空间。

**软件要求:** Ubuntu 20.04+CentOS 8+NVIDIA Driver 535+Docker 24+NVIDIA Container Toolkit

**环境验证:**

```bash
#
检查NVIDIA驱动
nvidia-smi
#
预期输出应显示GPU型号和驱动版本

# 检查Docker
docker --version
#
预期输出:Docker version 24.x.x

#
检查nvidia-container-toolkit
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
#
预期输出应显示GPU信息
```

###
编写docker-compose配置

创建项目目录并编写docker-compose.yml

```bash
mkdir -p ~/llm-deploy/qwen-7b
cd ~/llm-deploy/qwen-7b
```

```yaml
# docker-compose.yml
version: '3.8'

services:
  # Qwen-7B
推理服务
  qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    hostname: qwen7b-inference
    ports:
      - "8000:8000"
    environment:
      #
模型配置
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - TOKENIZER=Qwen/Qwen-7B-Chat
      #
性能优化
      - GPU_MEMORY_UTILIZATION=0.9
      - MAX_MODEL_LEN=8192
      - MAX_NUM_BATCHED_TOKENS=8192
      - MAX_NUM_SEQUENCES=256
      #
推理优化
      - ENABLE_CHUNKED_PREFILL=true
      - ENABLE_PREFIX_CACHING=true
      # GPU
配置
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=all
      #
日志配置
      - LOG_LEVEL=INFO
    volumes:
      # HuggingFace
模型缓存(避免重复下载)
      - model_cache:/root/.cache/huggingface
      # vLLM
服务日志
      - ./logs:/var/log/vllm
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb'  #
共享内存(对vLLM重要)
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    networks:
      - qwen-net

  # API
网关(Nginx
  api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      qwen7b-inference:
        condition: service_healthy
    restart: unless-stopped
    networks:
      - qwen-net

  # Prometheus
监控
  prometheus:
    image: prom/prometheus:latest
    container_name: qwen7b-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    restart: unless-stopped
    networks:
      - qwen-net

  # Grafana
可视化
  grafana:
    image: grafana/grafana:latest
    container_name: qwen7b-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - ./grafana/data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped
    networks:
      - qwen-net

networks:
  qwen-net:
    driver: bridge

volumes:
  model_cache:
```

### Nginx
配置

```bash
mkdir -p nginx ssl prometheus grafana/provisioning/datasources grafana/provisioning/dashboards grafana/dashboards logs
```

```nginx
# nginx/nginx.conf
events {
    worker_connections 1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    #
日志格式
    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$http_x_forwarded_for" '
                    'rt=$request_time uct="$upstream_connect_time" '
                    'uht="$upstream_header_time" urt="$upstream_response_time"';

    access_log /var/log/nginx/access.log main;
    error_log /var/log/nginx/error.log warn;

    #
性能优化
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    # Gzip
压缩
    gzip on;
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_types text/plain text/css text/xml application/json application/javascript application/rss+xml application/atom+xml image/svg+xml;

    #
限流配置
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/s;
    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

    upstream vllm_backend {
        server qwen7b-inference:8000;
        keepalive 64;
    }

    server {
        listen 80;
        server_name _;

        #
健康检查
        location /health {
            access_log off;
            return 200 "OK";
            add_header Content-Type text/plain;
        }

        # OpenAI
兼容API - Chat Completions
        location /v1/chat/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            #
代理头设置
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            #
流式响应(关键配置)
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_connect_timeout 60s;
            proxy_send_timeout 300s;

            # SSE
必需配置
            proxy_set_header Connection '';
            chunked_transfer_encoding on;

            #
记录响应时间
            add_header X-Response-Time $upstream_response_time;
        }

        # OpenAI
兼容API - Completions
        location /v1/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_set_header Connection '';
            chunked_transfer_encoding on;
        }

        #
模型列表
        location /v1/models {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }

        # vLLM
直接接口
        location /v1 {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }

        # Prometheus
指标(仅内网访问)
        location /metrics {
            proxy_pass http://prometheus:9090;
            proxy_set_header Host $host;
            allow 127.0.0.1;
            allow 10.0.0.0/8;
            allow 172.16.0.0/12;
            deny all;
        }
    }
}
```

### Prometheus
配置

```yaml
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['qwen7b-inference:8000']
    metrics_path: '/metrics'
    scrape_interval: 10s
```

###
启动服务

```bash
#
进入项目目录
cd ~/llm-deploy/qwen-7b

#
拉取镜像(提前完成,节省首次启动时间)
docker-compose pull

#
启动服务(后台运行)
docker-compose up -d

#
查看服务状态
docker-compose ps

#
查看日志
docker-compose logs -f qwen7b-inference

#
等待服务就绪后测试
# 首次启动需要下载模型,可能需要几分钟
sleep 120

#
健康检查
curl http://localhost:8000/health

#
模型列表
curl http://localhost:8000/v1/models

#
测试推理
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen-7B-Chat",
    "messages": [
      {"role": "system", "content": "
你是一个有帮助的AI助手。"},
      {"role": "user", "content": "
请介绍一下Docker容器化技术的优势。"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'
```

###
性能测试

```bash
#
安装测试工具
pip install locust

#
创建负载测试脚本 locustfile.py
cat > locustfile.py << 'EOF'
from locust import HttpUser, task, between
import json

class LLMUser(HttpUser):
    wait_time = between(1, 3)
    host = "http://localhost"

    @task(3)
    def chat_completion(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "system", "content": "
你是一个有帮助的AI助手。"},
                {"role": "user", "content": "
请用100字介绍一下人工智能。"}
            ],
            "max_tokens": 200,
            "temperature": 0.7,
            "stream": False
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")

    @task(1)
    def chat_completion_stream(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "user", "content": "
解释一下什么是机器学习。"}
            ],
            "max_tokens": 256,
            "temperature": 0.7,
            "stream": True
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")
EOF

#
运行负载测试(10个并发用户,持续60秒)
locust -f locustfile.py --headless -u 10 -r 2 -t 60s --csv results

# Web
界面测试(可选)
locust -f locustfile.py --web-port 8089
```

##
容器化后的Spring Boot对接方案

### 概述

在企业级应用中,Java/Spring Boot仍然是主流的后端开发框架。将大模型推理服务容器化部署后,如何在Spring Boot应用中对接使用是开发者普遍关心的问题。

### RestTemplate对接方案

最直接的对接方式是使用SpringRestTemplateWebClient调用大模型API

```java
package com.example.llm.config;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.http.client.SimpleClientHttpRequestFactory;
import org.springframework.web.client.RestTemplate;

@Configuration
public class RestTemplateConfig {

    @Bean
    public RestTemplate restTemplate() {
        SimpleClientHttpRequestFactory factory = new SimpleClientHttpRequestFactory();
        factory.setConnectTimeout(10000);      //
连接超时10
        factory.setReadTimeout(120000);         // 读取超时120秒(大模型推理可能较长)
        factory.setBufferRequestBody(true);

        return new RestTemplate(factory);
    }
}
```

### OpenAI
兼容客户端封装

为了方便对接支持OpenAI协议的大模型服务,封装一个统一的客户端:

```java
package com.example.llm.client;

import com.fasterxml.jackson.annotation.JsonProperty;
import lombok.Data;
import lombok.Builder;
import lombok.AllArgsConstructor;
import lombok.NoArgsConstructor;

import java.util.List;
import java.util.Map;

public class LLMClient {

    private final RestTemplate restTemplate;
    private final String baseUrl;

    public LLMClient(RestTemplate restTemplate, String baseUrl) {
        this.restTemplate = restTemplate;
        this.baseUrl = baseUrl;
    }

    /**
     *
聊天补全
     */
    public ChatCompletionResponse chatCompletion(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ResponseEntity<ChatCompletionResponse> response = restTemplate.exchange(
            url,
            HttpMethod.POST,
            entity,
            ChatCompletionResponse.class
        );

        return response.getBody();
    }

    /**
     *
流式聊天补全
     */
    public Flux<String> chatCompletionStream(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);
        headers.setAccept(MediaType.TEXT_EVENT_STREAM);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ParameterizedTypeReference<ServerSentEvent<String>> type = new ParameterizedTypeReference<>() {};

        return restTemplate.exchange(
                url,
                HttpMethod.POST,
                entity,
                new ParameterizedTypeReference<ServerSentEvent<String>>() {}
            )
            .getBody()
            .stream()
            .map(ServerSentEvent::data);
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionRequest {
        private String model;
        private List<Message> messages;
        private Double temperature;
        @JsonProperty("max_tokens")
        private Integer maxTokens;
        @JsonProperty("top_p")
        private Double topP;
        private Boolean stream;
        private List<String> stop;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Message {
        private String role;
        private String content;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionResponse {
        private String id;
        private String object;
        private Long created;
        private String model;
        private List<Choice> choices;
        private Usage usage;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Choice {
        private Integer index;
        private Message message;
        @JsonProperty("finish_reason")
        private String finishReason;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Usage {
        @JsonProperty("prompt_tokens")
        private Integer promptTokens;
        @JsonProperty("completion_tokens")
        private Integer completionTokens;
        @JsonProperty("total_tokens")
        private Integer totalTokens;
    }
}
```

### Spring Boot
服务实现

```java
package com.example.llm.service;

import com.example.llm.client.LLMClient.ChatCompletionRequest;
import com.example.llm.client.LLMClient.ChatCompletionResponse;
import com.example.llm.client.LLMClient.LLMClient;
import com.example.llm.client.LLMClient.Message;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;

import java.util.Arrays;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;

@Service
@RequiredArgsConstructor
@Slf4j
public class LLMService {

    private final LLMClient llmClient;

    /**
     *
同步对话
     */
    public String chat(String userMessage) {
        return chat(userMessage, null);
    }

    /**
     *
同步对话(带系统提示)
     */
    public String chat(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, false);

        try {
            ChatCompletionResponse response = llmClient.chatCompletion(request);

            if (response != null && response.getChoices() != null && !response.getChoices().isEmpty()) {
                return response.getChoices().get(0).getMessage().getContent();
            }
        } catch (Exception e) {
            log.error("LLM
调用失败", e);
            throw new RuntimeException("LLM
服务调用失败: " + e.getMessage(), e);
        }

        return null;
    }

    /**
     *
流式对话
     */
    public Flux<String> chatStream(String userMessage) {
        return chatStream(userMessage, null);
    }

    /**
     *
流式对话(带系统提示)
     */
    public Flux<String> chatStream(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, true);

        return llmClient.chatCompletionStream(request)
            .filter(data -> !data.startsWith("[DONE]"))
            .map(data -> {
                //
解析SSE数据
                if (data.startsWith("data:")) {
                    data = data.substring(5).trim();
                }
                return data;
            });
    }

    private ChatCompletionRequest buildRequest(String userMessage, String systemPrompt, boolean stream) {
        List<Message> messages = Arrays.asList(
            Message.builder()
                .role("system")
                .content(systemPrompt != null ? systemPrompt : "
你是一个有帮助的AI助手。")
                .build(),
            Message.builder()
                .role("user")
                .content(userMessage)
                .build()
        );

        return ChatCompletionRequest.builder()
            .model("Qwen/Qwen-7B-Chat")
            .messages(messages)
            .temperature(0.7)
            .maxTokens(2048)
            .stream(stream)
            .build();
    }

    /**
     *
异步对话
     */
    public CompletableFuture<String> chatAsync(String userMessage) {
        return CompletableFuture.supplyAsync(() -> chat(userMessage))
            .orTimeout(120, TimeUnit.SECONDS);
    }

    /**
     *
批量对话
     */
    public List<String> batchChat(List<String> messages) {
        return messages.parallelStream()
            .map(this::chat)
            .toList();
    }
}
```

###
控制器实现

```java
package com.example.llm.controller;

import com.example.llm.service.LLMService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.MediaType;
import org.springframework.http.codec.ServerSentEvent;
import org.springframework.web.bind.annotation.*;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;

import java.util.HashMap;
import java.util.Map;

@RestController
@RequestMapping("/api/llm")
@RequiredArgsConstructor
@Slf4j
public class LLMController {

    private final LLMService llmService;

    /**
     *
同步对话接口
     */
    @PostMapping("/chat")
    public Map<String, Object> chat(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("
收到聊天请求: {}", userMessage);

        String response = llmService.chat(userMessage, systemPrompt);

        Map<String, Object> result = new HashMap<>();
        result.put("code", 200);
        result.put("message", "success");
        result.put("data", response);

        return result;
    }

    /**
     *
流式对话接口
     */
    @PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<ServerSentEvent<String>> chatStream(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("
收到流式聊天请求: {}", userMessage);

        return llmService.chatStream(userMessage, systemPrompt)
            .map(data -> ServerSentEvent.<String>builder()
                .data(data)
                .build()
            )
            .onErrorResume(e -> {
                log.error("
流式对话异常", e);
                return Mono.just(ServerSentEvent.<String>builder()
                    .data("[ERROR]" + e.getMessage())
                    .build());
            })
            .concatWith(Mono.just(ServerSentEvent.<String>builder()
                .data("[DONE]")
                .build()));
    }

    /**
     *
异步对话接口
     */
    @PostMapping("/chat/async")
    public Map<String, Object> chatAsync(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");

        log.info("
收到异步聊天请求: {}", userMessage);

        //
立即返回,后续异步处理
        llmService.chatAsync(userMessage)
            .thenAccept(response -> {
                log.info("
异步任务完成: {}", response);
                //
这里可以添加回调逻辑,如发送WebSocket消息
            })
            .exceptionally(e -> {
                log.error("
异步任务失败", e);
                return null;
            });

        Map<String, Object> result = new HashMap<>();
        result.put("code", 202);
        result.put("message", "
请求已接受,正在异步处理");

        return result;
    }
}
```

###
熔断与重试配置

```java
package com.example.llm.config;

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.retry.RetryConfig;
import lombok.extern.slf4j.Slf4j;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.client.RestTemplate;

import java.time.Duration;

@Configuration
@Slf4j
public class ResilienceConfig {

    @Bean
    public CircuitBreaker llmCircuitBreaker() {
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)                    //
失败率阈值50%
            .slowCallRateThreshold(80)                    //
慢调用率阈值80%
            .slowCallDurationThreshold(Duration.ofSeconds(30))  //
慢调用阈值30
            .waitDurationInOpenState(Duration.ofMinutes(1))     // 熔断器开放时间1分钟
            .permittedNumberOfCallsInHalfOpenState(3)    // 半开状态允许调用次数
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .slidingWindowSize(10)                        //
滑动窗口大小
            .minimumNumberOfCalls(5)                      // 最小调用次数
            .build();

        return CircuitBreaker.of("llmService", config);
    }

    @Bean
    public Retry llmRetry() {
        RetryConfig config = RetryConfig.custom()
            .maxAttempts(3)                               //
最大重试次数
            .waitDuration(Duration.ofSeconds(2))          // 重试间隔
            .retryExceptions(Exception.class)             // 重试的异常类型
            .ignoreExceptions(IllegalArgumentException.class)  // 忽略的异常
            .build();

        return Retry.of("llmService", config);
    }
}
```

###
应用配置

```yaml
# application.yml
spring:
  application:
    name: llm-spring-boot-client

server:
  port: 8080

llm:
  #
大模型服务地址
  base-url: http://localhost:80
  #
连接超时(毫秒)
  connect-timeout: 10000
  #
读取超时(毫秒)
  read-timeout: 120000

#
熔断器配置
resilience4j:
  circuitbreaker:
    configs:
      default:
        failureRateThreshold: 50
        waitDurationInOpenState: 60s
    instances:
      llmService:
        registerHealthIndicator: true
        slidingWindowSize: 10

#
日志配置
logging:
  level:
    com.example.llm: DEBUG
    org.springframework.web.client: DEBUG
```

### Docker
Spring Boot应用

如果需要将Spring Boot应用也容器化部署,可以编写Dockerfile

```dockerfile
# Multi-stage build
FROM eclipse-temurin:17-jdk-alpine AS builder

WORKDIR /app

#
复制Maven配置
COPY pom.xml .
COPY src ./src

#
下载依赖(利用缓存)
RUN ./mvnw dependency:go-offline -B

#
构建
RUN ./mvnw package -DskipTests -B

#
运行阶段
FROM eclipse-temurin:17-jre-alpine

WORKDIR /app

#
安全配置
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
RUN chown -R appuser:appgroup /app

#
复制jar
COPY --from=builder /app/target/*.jar app.jar

#
复制模型缓存(可选,共享HuggingFace缓存)
VOLUME ["/root/.cache/huggingface"]

USER appuser

EXPOSE 8080

ENTRYPOINT ["java", "-jar", "-Xms2g", "-Xmx4g", "-XX:+UseG1GC", "app.jar"]
```


相应的docker-compose配置:

```yaml
version: '3.8'

services:
  spring-boot-app:
    build: ./spring-boot-app
    container_name: spring-boot-llm-client
    ports:
      - "8080:8080"
    environment:
      - LLM_BASE_URL=http://qwen7b-gateway:80
      - SPRING_PROFILES_ACTIVE=production
    depends_on:
      - api-gateway
    networks:
      - qwen-net

  api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - qwen7b-inference
    networks:
      - qwen-net

  qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.9
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb'
    networks:
      - qwen-net

networks:
  qwen-net:
    driver: bridge

volumes:
  model_cache:
```

##
总结

Docker容器化技术为大模型的部署提供了标准化、可复制、易扩展的解决方案。通过本文的详细介绍,我们已经覆盖了从环境配置、镜像制作、服务编排到生产部署的完整流程。

**核心要点回顾:**

1. **
环境配置**:正确安装nvidia-container-toolkitGPU容器化的前提,CUDA版本兼容性需要仔细规划。

2. **镜像制作**:选择合适的Base Image,合理利用多阶段构建减小镜像体积,通过层缓存优化构建速度。

3. **服务编排**Docker Compose适合单主机多容器编排,Kubernetes适合大规模生产部署。

4. **性能优化**:合理的内存配置、批量推理、流式输出等优化手段可显著提升推理吞吐量。

5. **安全加固**:以非root用户运行、限制容器能力、启用TLS加密、进行安全扫描是基本的安全实践。

6. **Spring Boot对接**:通过封装统一的API客户端,Spring Boot应用可以方便地调用容器化部署的大模型服务。

随着容器化和云原生技术的持续发展,大模型的部署将变得更加简单高效。希望本文能够为你的大模型生产部署提供有价值的参考。

2

集群模式部署配置

对于大规模推理服务,需要支持多实例部署和负载均衡。以下是一个支持水平扩展的集群部署配置:


### 环境准备

在开始部署前,需要确保环境满足以下要求:

**硬件要求:** NVIDIA GPU(至少16GB显存,推荐RTX 4080或更高),8核以上CPU32GB系统内存,100GB可用磁盘空间。

**软件要求:** Ubuntu 20.04+CentOS 8+NVIDIA Driver 535+Docker 24+NVIDIA Container Toolkit

**环境验证:**

```bash
#
检查NVIDIA驱动
nvidia-smi
#
预期输出应显示GPU型号和驱动版本

# 检查Docker
docker --version
#
预期输出:Docker version 24.x.x

#
检查nvidia-container-toolkit
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
#
预期输出应显示GPU信息
```

###
编写docker-compose配置

创建项目目录并编写docker-compose.yml

```bash
mkdir -p ~/llm-deploy/qwen-7b
cd ~/llm-deploy/qwen-7b
```

```yaml
# docker-compose.yml
version: '3.8'

services:
  # Qwen-7B
推理服务
  qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    hostname: qwen7b-inference
    ports:
      - "8000:8000"
    environment:
      #
模型配置
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - TOKENIZER=Qwen/Qwen-7B-Chat
      #
性能优化
      - GPU_MEMORY_UTILIZATION=0.9
      - MAX_MODEL_LEN=8192
      - MAX_NUM_BATCHED_TOKENS=8192
      - MAX_NUM_SEQUENCES=256
      #
推理优化
      - ENABLE_CHUNKED_PREFILL=true
      - ENABLE_PREFIX_CACHING=true
      # GPU
配置
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=all
      #
日志配置
      - LOG_LEVEL=INFO
    volumes:
      # HuggingFace
模型缓存(避免重复下载)
      - model_cache:/root/.cache/huggingface
      # vLLM
服务日志
      - ./logs:/var/log/vllm
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb'  #
共享内存(对vLLM重要)
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    networks:
      - qwen-net

  # API
网关(Nginx
  api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      qwen7b-inference:
        condition: service_healthy
    restart: unless-stopped
    networks:
      - qwen-net

  # Prometheus
监控
  prometheus:
    image: prom/prometheus:latest
    container_name: qwen7b-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    restart: unless-stopped
    networks:
      - qwen-net

  # Grafana
可视化
  grafana:
    image: grafana/grafana:latest
    container_name: qwen7b-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - ./grafana/data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped
    networks:
      - qwen-net

networks:
  qwen-net:
    driver: bridge

volumes:
  model_cache:
```

### Nginx
配置

```bash
mkdir -p nginx ssl prometheus grafana/provisioning/datasources grafana/provisioning/dashboards grafana/dashboards logs
```

```nginx
# nginx/nginx.conf
events {
    worker_connections 1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    #
日志格式
    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$http_x_forwarded_for" '
                    'rt=$request_time uct="$upstream_connect_time" '
                    'uht="$upstream_header_time" urt="$upstream_response_time"';

    access_log /var/log/nginx/access.log main;
    error_log /var/log/nginx/error.log warn;

    #
性能优化
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    # Gzip
压缩
    gzip on;
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_types text/plain text/css text/xml application/json application/javascript application/rss+xml application/atom+xml image/svg+xml;

    #
限流配置
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/s;
    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

    upstream vllm_backend {
        server qwen7b-inference:8000;
        keepalive 64;
    }

    server {
        listen 80;
        server_name _;

        #
健康检查
        location /health {
            access_log off;
            return 200 "OK";
            add_header Content-Type text/plain;
        }

        # OpenAI
兼容API - Chat Completions
        location /v1/chat/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            #
代理头设置
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            #
流式响应(关键配置)
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_connect_timeout 60s;
            proxy_send_timeout 300s;

            # SSE
必需配置
            proxy_set_header Connection '';
            chunked_transfer_encoding on;

            #
记录响应时间
            add_header X-Response-Time $upstream_response_time;
        }

        # OpenAI
兼容API - Completions
        location /v1/completions {
            limit_req zone=api_limit burst=50 nodelay;
            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;

            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 300s;
            proxy_set_header Connection '';
            chunked_transfer_encoding on;
        }

        #
模型列表
        location /v1/models {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }

        # vLLM
直接接口
        location /v1 {
            proxy_pass http://vllm_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }

        # Prometheus
指标(仅内网访问)
        location /metrics {
            proxy_pass http://prometheus:9090;
            proxy_set_header Host $host;
            allow 127.0.0.1;
            allow 10.0.0.0/8;
            allow 172.16.0.0/12;
            deny all;
        }
    }
}
```

### Prometheus
配置

```yaml
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['qwen7b-inference:8000']
    metrics_path: '/metrics'
    scrape_interval: 10s
```

###
启动服务

```bash
#
进入项目目录
cd ~/llm-deploy/qwen-7b

#
拉取镜像(提前完成,节省首次启动时间)
docker-compose pull

#
启动服务(后台运行)
docker-compose up -d

#
查看服务状态
docker-compose ps

#
查看日志
docker-compose logs -f qwen7b-inference

#
等待服务就绪后测试
# 首次启动需要下载模型,可能需要几分钟
sleep 120

#
健康检查
curl http://localhost:8000/health

#
模型列表
curl http://localhost:8000/v1/models

#
测试推理
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen-7B-Chat",
    "messages": [
      {"role": "system", "content": "
你是一个有帮助的AI助手。"},
      {"role": "user", "content": "
请介绍一下Docker容器化技术的优势。"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'
```

###
性能测试

```bash
#
安装测试工具
pip install locust

#
创建负载测试脚本 locustfile.py
cat > locustfile.py << 'EOF'
from locust import HttpUser, task, between
import json

class LLMUser(HttpUser):
    wait_time = between(1, 3)
    host = "http://localhost"

    @task(3)
    def chat_completion(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "system", "content": "
你是一个有帮助的AI助手。"},
                {"role": "user", "content": "
请用100字介绍一下人工智能。"}
            ],
            "max_tokens": 200,
            "temperature": 0.7,
            "stream": False
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")

    @task(1)
    def chat_completion_stream(self):
        payload = {
            "model": "Qwen/Qwen-7B-Chat",
            "messages": [
                {"role": "user", "content": "
解释一下什么是机器学习。"}
            ],
            "max_tokens": 256,
            "temperature": 0.7,
            "stream": True
        }
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            catch_response=True
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with status {response.status_code}")
EOF

#
运行负载测试(10个并发用户,持续60秒)
locust -f locustfile.py --headless -u 10 -r 2 -t 60s --csv results

# Web
界面测试(可选)
locust -f locustfile.py --web-port 8089
```

##
容器化后的Spring Boot对接方案

### 概述

在企业级应用中,Java/Spring Boot仍然是主流的后端开发框架。将大模型推理服务容器化部署后,如何在Spring Boot应用中对接使用是开发者普遍关心的问题。

### RestTemplate对接方案

最直接的对接方式是使用SpringRestTemplateWebClient调用大模型API

```java
package com.example.llm.config;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.http.client.SimpleClientHttpRequestFactory;
import org.springframework.web.client.RestTemplate;

@Configuration
public class RestTemplateConfig {

    @Bean
    public RestTemplate restTemplate() {
        SimpleClientHttpRequestFactory factory = new SimpleClientHttpRequestFactory();
        factory.setConnectTimeout(10000);      //
连接超时10
        factory.setReadTimeout(120000);         // 读取超时120秒(大模型推理可能较长)
        factory.setBufferRequestBody(true);

        return new RestTemplate(factory);
    }
}
```

### OpenAI
兼容客户端封装

为了方便对接支持OpenAI协议的大模型服务,封装一个统一的客户端:

```java
package com.example.llm.client;

import com.fasterxml.jackson.annotation.JsonProperty;
import lombok.Data;
import lombok.Builder;
import lombok.AllArgsConstructor;
import lombok.NoArgsConstructor;

import java.util.List;
import java.util.Map;

public class LLMClient {

    private final RestTemplate restTemplate;
    private final String baseUrl;

    public LLMClient(RestTemplate restTemplate, String baseUrl) {
        this.restTemplate = restTemplate;
        this.baseUrl = baseUrl;
    }

    /**
     *
聊天补全
     */
    public ChatCompletionResponse chatCompletion(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ResponseEntity<ChatCompletionResponse> response = restTemplate.exchange(
            url,
            HttpMethod.POST,
            entity,
            ChatCompletionResponse.class
        );

        return response.getBody();
    }

    /**
     *
流式聊天补全
     */
    public Flux<String> chatCompletionStream(ChatCompletionRequest request) {
        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);
        headers.setAccept(MediaType.TEXT_EVENT_STREAM);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ParameterizedTypeReference<ServerSentEvent<String>> type = new ParameterizedTypeReference<>() {};

        return restTemplate.exchange(
                url,
                HttpMethod.POST,
                entity,
                new ParameterizedTypeReference<ServerSentEvent<String>>() {}
            )
            .getBody()
            .stream()
            .map(ServerSentEvent::data);
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionRequest {
        private String model;
        private List<Message> messages;
        private Double temperature;
        @JsonProperty("max_tokens")
        private Integer maxTokens;
        @JsonProperty("top_p")
        private Double topP;
        private Boolean stream;
        private List<String> stop;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Message {
        private String role;
        private String content;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class ChatCompletionResponse {
        private String id;
        private String object;
        private Long created;
        private String model;
        private List<Choice> choices;
        private Usage usage;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Choice {
        private Integer index;
        private Message message;
        @JsonProperty("finish_reason")
        private String finishReason;
    }

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public static class Usage {
        @JsonProperty("prompt_tokens")
        private Integer promptTokens;
        @JsonProperty("completion_tokens")
        private Integer completionTokens;
        @JsonProperty("total_tokens")
        private Integer totalTokens;
    }
}
```

### Spring Boot
服务实现

```java
package com.example.llm.service;

import com.example.llm.client.LLMClient.ChatCompletionRequest;
import com.example.llm.client.LLMClient.ChatCompletionResponse;
import com.example.llm.client.LLMClient.LLMClient;
import com.example.llm.client.LLMClient.Message;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Service;

import java.util.Arrays;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;

@Service
@RequiredArgsConstructor
@Slf4j
public class LLMService {

    private final LLMClient llmClient;

    /**
     *
同步对话
     */
    public String chat(String userMessage) {
        return chat(userMessage, null);
    }

    /**
     *
同步对话(带系统提示)
     */
    public String chat(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, false);

        try {
            ChatCompletionResponse response = llmClient.chatCompletion(request);

            if (response != null && response.getChoices() != null && !response.getChoices().isEmpty()) {
                return response.getChoices().get(0).getMessage().getContent();
            }
        } catch (Exception e) {
            log.error("LLM
调用失败", e);
            throw new RuntimeException("LLM
服务调用失败: " + e.getMessage(), e);
        }

        return null;
    }

    /**
     *
流式对话
     */
    public Flux<String> chatStream(String userMessage) {
        return chatStream(userMessage, null);
    }

    /**
     *
流式对话(带系统提示)
     */
    public Flux<String> chatStream(String userMessage, String systemPrompt) {
        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, true);

        return llmClient.chatCompletionStream(request)
            .filter(data -> !data.startsWith("[DONE]"))
            .map(data -> {
                //
解析SSE数据
                if (data.startsWith("data:")) {
                    data = data.substring(5).trim();
                }
                return data;
            });
    }

    private ChatCompletionRequest buildRequest(String userMessage, String systemPrompt, boolean stream) {
        List<Message> messages = Arrays.asList(
            Message.builder()
                .role("system")
                .content(systemPrompt != null ? systemPrompt : "
你是一个有帮助的AI助手。")
                .build(),
            Message.builder()
                .role("user")
                .content(userMessage)
                .build()
        );

        return ChatCompletionRequest.builder()
            .model("Qwen/Qwen-7B-Chat")
            .messages(messages)
            .temperature(0.7)
            .maxTokens(2048)
            .stream(stream)
            .build();
    }

    /**
     *
异步对话
     */
    public CompletableFuture<String> chatAsync(String userMessage) {
        return CompletableFuture.supplyAsync(() -> chat(userMessage))
            .orTimeout(120, TimeUnit.SECONDS);
    }

    /**
     *
批量对话
     */
    public List<String> batchChat(List<String> messages) {
        return messages.parallelStream()
            .map(this::chat)
            .toList();
    }
}
```

###
控制器实现

```java
package com.example.llm.controller;

import com.example.llm.service.LLMService;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.MediaType;
import org.springframework.http.codec.ServerSentEvent;
import org.springframework.web.bind.annotation.*;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;

import java.util.HashMap;
import java.util.Map;

@RestController
@RequestMapping("/api/llm")
@RequiredArgsConstructor
@Slf4j
public class LLMController {

    private final LLMService llmService;

    /**
     *
同步对话接口
     */
    @PostMapping("/chat")
    public Map<String, Object> chat(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("
收到聊天请求: {}", userMessage);

        String response = llmService.chat(userMessage, systemPrompt);

        Map<String, Object> result = new HashMap<>();
        result.put("code", 200);
        result.put("message", "success");
        result.put("data", response);

        return result;
    }

    /**
     *
流式对话接口
     */
    @PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<ServerSentEvent<String>> chatStream(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");
        String systemPrompt = request.get("systemPrompt");

        log.info("
收到流式聊天请求: {}", userMessage);

        return llmService.chatStream(userMessage, systemPrompt)
            .map(data -> ServerSentEvent.<String>builder()
                .data(data)
                .build()
            )
            .onErrorResume(e -> {
                log.error("
流式对话异常", e);
                return Mono.just(ServerSentEvent.<String>builder()
                    .data("[ERROR]" + e.getMessage())
                    .build());
            })
            .concatWith(Mono.just(ServerSentEvent.<String>builder()
                .data("[DONE]")
                .build()));
    }

    /**
     *
异步对话接口
     */
    @PostMapping("/chat/async")
    public Map<String, Object> chatAsync(@RequestBody Map<String, String> request) {
        String userMessage = request.get("message");

        log.info("
收到异步聊天请求: {}", userMessage);

        //
立即返回,后续异步处理
        llmService.chatAsync(userMessage)
            .thenAccept(response -> {
                log.info("
异步任务完成: {}", response);
                //
这里可以添加回调逻辑,如发送WebSocket消息
            })
            .exceptionally(e -> {
                log.error("
异步任务失败", e);
                return null;
            });

        Map<String, Object> result = new HashMap<>();
        result.put("code", 202);
        result.put("message", "
请求已接受,正在异步处理");

        return result;
    }
}
```

###
熔断与重试配置

```java
package com.example.llm.config;

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.retry.RetryConfig;
import lombok.extern.slf4j.Slf4j;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.client.RestTemplate;

import java.time.Duration;

@Configuration
@Slf4j
public class ResilienceConfig {

    @Bean
    public CircuitBreaker llmCircuitBreaker() {
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)                    //
失败率阈值50%
            .slowCallRateThreshold(80)                    //
慢调用率阈值80%
            .slowCallDurationThreshold(Duration.ofSeconds(30))  //
慢调用阈值30
            .waitDurationInOpenState(Duration.ofMinutes(1))     // 熔断器开放时间1分钟
            .permittedNumberOfCallsInHalfOpenState(3)    // 半开状态允许调用次数
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .slidingWindowSize(10)                        //
滑动窗口大小
            .minimumNumberOfCalls(5)                      // 最小调用次数
            .build();

        return CircuitBreaker.of("llmService", config);
    }

    @Bean
    public Retry llmRetry() {
        RetryConfig config = RetryConfig.custom()
            .maxAttempts(3)                               //
最大重试次数
            .waitDuration(Duration.ofSeconds(2))          // 重试间隔
            .retryExceptions(Exception.class)             // 重试的异常类型
            .ignoreExceptions(IllegalArgumentException.class)  // 忽略的异常
            .build();

        return Retry.of("llmService", config);
    }
}
```

###
应用配置

```yaml
# application.yml
spring:
  application:
    name: llm-spring-boot-client

server:
  port: 8080

llm:
  #
大模型服务地址
  base-url: http://localhost:80
  #
连接超时(毫秒)
  connect-timeout: 10000
  #
读取超时(毫秒)
  read-timeout: 120000

#
熔断器配置
resilience4j:
  circuitbreaker:
    configs:
      default:
        failureRateThreshold: 50
        waitDurationInOpenState: 60s
    instances:
      llmService:
        registerHealthIndicator: true
        slidingWindowSize: 10

#
日志配置
logging:
  level:
    com.example.llm: DEBUG
    org.springframework.web.client: DEBUG
```

### Docker
Spring Boot应用

如果需要将Spring Boot应用也容器化部署,可以编写Dockerfile

```dockerfile
# Multi-stage build
FROM eclipse-temurin:17-jdk-alpine AS builder

WORKDIR /app

#
复制Maven配置
COPY pom.xml .
COPY src ./src

#
下载依赖(利用缓存)
RUN ./mvnw dependency:go-offline -B

#
构建
RUN ./mvnw package -DskipTests -B

#
运行阶段
FROM eclipse-temurin:17-jre-alpine

WORKDIR /app

#
安全配置
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
RUN chown -R appuser:appgroup /app

#
复制jar
COPY --from=builder /app/target/*.jar app.jar

#
复制模型缓存(可选,共享HuggingFace缓存)
VOLUME ["/root/.cache/huggingface"]

USER appuser

EXPOSE 8080

ENTRYPOINT ["java", "-jar", "-Xms2g", "-Xmx4g", "-XX:+UseG1GC", "app.jar"]
```


相应的docker-compose配置:

```yaml
version: '3.8'

services:
  spring-boot-app:
    build: ./spring-boot-app
    container_name: spring-boot-llm-client
    ports:
      - "8080:8080"
    environment:
      - LLM_BASE_URL=http://qwen7b-gateway:80
      - SPRING_PROFILES_ACTIVE=production
    depends_on:
      - api-gateway
    networks:
      - qwen-net

  api-gateway:
    image: nginx:alpine
    container_name: qwen7b-gateway
    ports:
      - "80:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - qwen7b-inference
    networks:
      - qwen-net

  qwen7b-inference:
    image: vllm/vllm-openai:latest
    container_name: qwen7b-inference
    environment:
      - MODEL_NAME=Qwen/Qwen-7B-Chat
      - GPU_MEMORY_UTILIZATION=0.9
    volumes:
      - model_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    shm_size: '16gb'
    networks:
      - qwen-net

networks:
  qwen-net:
    driver: bridge

volumes:
  model_cache:
```

##
总结

Docker容器化技术为大模型的部署提供了标准化、可复制、易扩展的解决方案。通过本文的详细介绍,我们已经覆盖了从环境配置、镜像制作、服务编排到生产部署的完整流程。

**核心要点回顾:**

1. **
环境配置**:正确安装nvidia-container-toolkitGPU容器化的前提,CUDA版本兼容性需要仔细规划。

2. **镜像制作**:选择合适的Base Image,合理利用多阶段构建减小镜像体积,通过层缓存优化构建速度。

3. **服务编排**Docker Compose适合单主机多容器编排,Kubernetes适合大规模生产部署。

4. **性能优化**:合理的内存配置、批量推理、流式输出等优化手段可显著提升推理吞吐量。

5. **安全加固**:以非root用户运行、限制容器能力、启用TLS加密、进行安全扫描是基本的安全实践。

6. **Spring Boot对接**:通过封装统一的API客户端,Spring Boot应用可以方便地调用容器化部署的大模型服务。

随着容器化和云原生技术的持续发展,大模型的部署将变得更加简单高效。希望本文能够为你的大模型生产部署提供有价值的参考。

4

环境准备

在开始部署前,需要确保环境满足以下要求:

**硬件要求:** NVIDIA GPU(至少16GB显存,推荐RTX 4080或更高),8核以上CPU,32GB系统内存,100GB可用磁盘空间。

**软件要求:** Ubuntu 20.04+或CentOS 8+,NVIDIA Driver 535+,Docker 24+,NVIDIA Container Toolkit。

**环境验证:**

```bash

检查NVIDIA驱动

nvidia-smi

预期输出应显示GPU型号和驱动版本

检查Docker

docker --version

预期输出:Docker version 24.x.x

检查nvidia-container-toolkit

docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

预期输出应显示GPU信息

```

编写docker-compose配置

创建项目目录并编写docker-compose.yml:

```bash

mkdir -p ~/llm-deploy/qwen-7b

cd ~/llm-deploy/qwen-7b

```

```yaml

docker-compose.yml

version: '3.8'

services:

  # Qwen-7B推理服务

  qwen7b-inference:

    image: vllm/vllm-openai:latest

    container_name: qwen7b-inference

    hostname: qwen7b-inference

    ports:

      - "8000:8000"

    environment:

      # 模型配置

      - MODEL_NAME=Qwen/Qwen-7B-Chat

      - TOKENIZER=Qwen/Qwen-7B-Chat

      # 性能优化

      - GPU_MEMORY_UTILIZATION=0.9

      - MAX_MODEL_LEN=8192

      - MAX_NUM_BATCHED_TOKENS=8192

      - MAX_NUM_SEQUENCES=256

      # 推理优化

      - ENABLE_CHUNKED_PREFILL=true

      - ENABLE_PREFIX_CACHING=true

      # GPU配置

      - NVIDIA_VISIBLE_DEVICES=all

      - CUDA_VISIBLE_DEVICES=all

      # 日志配置

      - LOG_LEVEL=INFO

    volumes:

      # HuggingFace模型缓存(避免重复下载)

      - model_cache:/root/.cache/huggingface

      # vLLM服务日志

      - ./logs:/var/log/vllm

    deploy:

      resources:

        reservations:

          devices:

            - driver: nvidia

              count: 1

              capabilities: [gpu]

    shm_size: '16gb'  # 共享内存(对vLLM重要)

    restart: unless-stopped

    healthcheck:

      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]

      interval: 30s

      timeout: 10s

      retries: 3

      start_period: 300s

    networks:

      - qwen-net

  # API网关(Nginx)

  api-gateway:

    image: nginx:alpine

    container_name: qwen7b-gateway

    ports:

      - "80:80"

      - "443:443"

    volumes:

      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro

      - ./nginx/ssl:/etc/nginx/ssl:ro

    depends_on:

      qwen7b-inference:

        condition: service_healthy

    restart: unless-stopped

    networks:

      - qwen-net

  # Prometheus监控

  prometheus:

    image: prom/prometheus:latest

    container_name: qwen7b-prometheus

    ports:

      - "9090:9090"

    volumes:

      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro

      - ./prometheus/prometheus_data:/prometheus

    command:

      - '--config.file=/etc/prometheus/prometheus.yml'

      - '--storage.tsdb.path=/prometheus'

      - '--web.console.libraries=/etc/prometheus/console_libraries'

      - '--web.console.templates=/etc/prometheus/consoles'

    restart: unless-stopped

    networks:

      - qwen-net

  # Grafana可视化

  grafana:

    image: grafana/grafana:latest

    container_name: qwen7b-grafana

    ports:

      - "3000:3000"

    environment:

      - GF_SECURITY_ADMIN_USER=admin

      - GF_SECURITY_ADMIN_PASSWORD=admin123

      - GF_USERS_ALLOW_SIGN_UP=false

    volumes:

      - ./grafana/provisioning:/etc/grafana/provisioning:ro

      - ./grafana/dashboards:/var/lib/grafana/dashboards

      - ./grafana/data:/var/lib/grafana

    depends_on:

      - prometheus

    restart: unless-stopped

    networks:

      - qwen-net

networks:

  qwen-net:

    driver: bridge

volumes:

  model_cache:

```

Nginx配置

```bash

mkdir -p nginx ssl prometheus grafana/provisioning/datasources grafana/provisioning/dashboards grafana/dashboards logs

```

```nginx

nginx/nginx.conf

events {

    worker_connections 1024;

}

http {

    include       /etc/nginx/mime.types;

    default_type  application/octet-stream;

    # 日志格式

    log_format main '$remote_addr - $remote_user [$time_local] "$request" '

                    '$status $body_bytes_sent "$http_referer" '

                    '"$http_user_agent" "$http_x_forwarded_for" '

                    'rt=$request_time uct="$upstream_connect_time" '

                    'uht="$upstream_header_time" urt="$upstream_response_time"';

    access_log /var/log/nginx/access.log main;

    error_log /var/log/nginx/error.log warn;

    # 性能优化

    sendfile on;

    tcp_nopush on;

    tcp_nodelay on;

    keepalive_timeout 65;

    types_hash_max_size 2048;

    # Gzip压缩

    gzip on;

    gzip_vary on;

    gzip_proxied any;

    gzip_comp_level 6;

    gzip_types text/plain text/css text/xml application/json application/javascript application/rss+xml application/atom+xml image/svg+xml;

    # 限流配置

    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/s;

    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;

    upstream vllm_backend {

        server qwen7b-inference:8000;

        keepalive 64;

    }

    server {

        listen 80;

        server_name _;

        # 健康检查

        location /health {

            access_log off;

            return 200 "OK";

            add_header Content-Type text/plain;

        }

        # OpenAI兼容API - Chat Completions

        location /v1/chat/completions {

            limit_req zone=api_limit burst=50 nodelay;

            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;

            proxy_http_version 1.1;

            # 代理头设置

            proxy_set_header Host $host;

            proxy_set_header X-Real-IP $remote_addr;

            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

            proxy_set_header X-Forwarded-Proto $scheme;

            # 流式响应(关键配置)

            proxy_buffering off;

            proxy_cache off;

            proxy_read_timeout 300s;

            proxy_connect_timeout 60s;

            proxy_send_timeout 300s;

            # SSE必需配置

            proxy_set_header Connection '';

            chunked_transfer_encoding on;

            # 记录响应时间

            add_header X-Response-Time $upstream_response_time;

        }

        # OpenAI兼容API - Completions

        location /v1/completions {

            limit_req zone=api_limit burst=50 nodelay;

            limit_conn conn_limit 10;

            proxy_pass http://vllm_backend;

            proxy_http_version 1.1;

            proxy_set_header Host $host;

            proxy_set_header X-Real-IP $remote_addr;

            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

            proxy_set_header X-Forwarded-Proto $scheme;

            proxy_buffering off;

            proxy_cache off;

            proxy_read_timeout 300s;

            proxy_set_header Connection '';

            chunked_transfer_encoding on;

        }

        # 模型列表

        location /v1/models {

            proxy_pass http://vllm_backend;

            proxy_http_version 1.1;

            proxy_set_header Host $host;

            proxy_set_header X-Real-IP $remote_addr;

        }

        # vLLM直接接口

        location /v1 {

            proxy_pass http://vllm_backend;

            proxy_http_version 1.1;

            proxy_set_header Host $host;

            proxy_set_header X-Real-IP $remote_addr;

            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

            proxy_set_header X-Forwarded-Proto $scheme;

        }

        # Prometheus指标(仅内网访问)

        location /metrics {

            proxy_pass http://prometheus:9090;

            proxy_set_header Host $host;

            allow 127.0.0.1;

            allow 10.0.0.0/8;

            allow 172.16.0.0/12;

            deny all;

        }

    }

}

```

Prometheus配置

```yaml

prometheus/prometheus.yml

global:

  scrape_interval: 15s

  evaluation_interval: 15s

scrape_configs:

  - job_name: 'vllm'

    static_configs:

      - targets: ['qwen7b-inference:8000']

    metrics_path: '/metrics'

    scrape_interval: 10s

```

启动服务

```bash

进入项目目录

cd ~/llm-deploy/qwen-7b

拉取镜像(提前完成,节省首次启动时间)

docker-compose pull

启动服务(后台运行)

docker-compose up -d

查看服务状态

docker-compose ps

查看日志

docker-compose logs -f qwen7b-inference

等待服务就绪后测试

首次启动需要下载模型,可能需要几分钟

sleep 120

健康检查

curl http://localhost:8000/health

模型列表

curl http://localhost:8000/v1/models

测试推理

curl -X POST http://localhost:8000/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "Qwen/Qwen-7B-Chat",

    "messages": [

      {"role": "system", "content": "你是一个有帮助的AI助手。"},

      {"role": "user", "content": "请介绍一下Docker容器化技术的优势。"}

    ],

    "max_tokens": 512,

    "temperature": 0.7

  }'

```

性能测试

```bash

安装测试工具

pip install locust

创建负载测试脚本 locustfile.py

cat > locustfile.py << 'EOF'

from locust import HttpUser, task, between

import json

class LLMUser(HttpUser):

    wait_time = between(1, 3)

    host = "http://localhost"

    @task(3)

    def chat_completion(self):

        payload = {

            "model": "Qwen/Qwen-7B-Chat",

            "messages": [

                {"role": "system", "content": "你是一个有帮助的AI助手。"},

                {"role": "user", "content": "请用100字介绍一下人工智能。"}

            ],

            "max_tokens": 200,

            "temperature": 0.7,

            "stream": False

        }

        with self.client.post(

            "/v1/chat/completions",

            json=payload,

            catch_response=True

        ) as response:

            if response.status_code == 200:

                response.success()

            else:

                response.failure(f"Failed with status {response.status_code}")

    @task(1)

    def chat_completion_stream(self):

        payload = {

            "model": "Qwen/Qwen-7B-Chat",

            "messages": [

                {"role": "user", "content": "解释一下什么是机器学习。"}

            ],

            "max_tokens": 256,

            "temperature": 0.7,

            "stream": True

        }

        with self.client.post(

            "/v1/chat/completions",

            json=payload,

            catch_response=True

        ) as response:

            if response.status_code == 200:

                response.success()

            else:

                response.failure(f"Failed with status {response.status_code}")

EOF

运行负载测试(10个并发用户,持续60秒)

locust -f locustfile.py --headless -u 10 -r 2 -t 60s --csv results

Web界面测试(可选)

locust -f locustfile.py --web-port 8089

```

容器化后的Spring Boot对接方案

概述

在企业级应用中,Java/Spring Boot仍然是主流的后端开发框架。将大模型推理服务容器化部署后,如何在Spring Boot应用中对接使用是开发者普遍关心的问题。

RestTemplate对接方案

最直接的对接方式是使用Spring的RestTemplate或WebClient调用大模型API:

```java

package com.example.llm.config;

import org.springframework.context.annotation.Bean;

import org.springframework.context.annotation.Configuration;

import org.springframework.http.client.SimpleClientHttpRequestFactory;

import org.springframework.web.client.RestTemplate;

@Configuration

public class RestTemplateConfig {

    @Bean

    public RestTemplate restTemplate() {

        SimpleClientHttpRequestFactory factory = new SimpleClientHttpRequestFactory();

        factory.setConnectTimeout(10000);      // 连接超时10秒

        factory.setReadTimeout(120000);         // 读取超时120秒(大模型推理可能较长)

        factory.setBufferRequestBody(true);

        return new RestTemplate(factory);

    }

}

```

OpenAI兼容客户端封装

为了方便对接支持OpenAI协议的大模型服务,封装一个统一的客户端:

```java

package com.example.llm.client;

import com.fasterxml.jackson.annotation.JsonProperty;

import lombok.Data;

import lombok.Builder;

import lombok.AllArgsConstructor;

import lombok.NoArgsConstructor;

import java.util.List;

import java.util.Map;

public class LLMClient {

    private final RestTemplate restTemplate;

    private final String baseUrl;

    public LLMClient(RestTemplate restTemplate, String baseUrl) {

        this.restTemplate = restTemplate;

        this.baseUrl = baseUrl;

    }

    /**

     * 聊天补全

     */

    public ChatCompletionResponse chatCompletion(ChatCompletionRequest request) {

        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();

        headers.setContentType(MediaType.APPLICATION_JSON);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ResponseEntity<ChatCompletionResponse> response = restTemplate.exchange(

            url,

            HttpMethod.POST,

            entity,

            ChatCompletionResponse.class

        );

        return response.getBody();

    }

    /**

     * 流式聊天补全

     */

    public Flux<String> chatCompletionStream(ChatCompletionRequest request) {

        String url = baseUrl + "/v1/chat/completions";

        HttpHeaders headers = new HttpHeaders();

        headers.setContentType(MediaType.APPLICATION_JSON);

        headers.setAccept(MediaType.TEXT_EVENT_STREAM);

        HttpEntity<ChatCompletionRequest> entity = new HttpEntity<>(request, headers);

        ParameterizedTypeReference<ServerSentEvent<String>> type = new ParameterizedTypeReference<>() {};

        return restTemplate.exchange(

                url,

                HttpMethod.POST,

                entity,

                new ParameterizedTypeReference<ServerSentEvent<String>>() {}

            )

            .getBody()

            .stream()

            .map(ServerSentEvent::data);

    }

    @Data

    @Builder

    @NoArgsConstructor

    @AllArgsConstructor

    public static class ChatCompletionRequest {

        private String model;

        private List<Message> messages;

        private Double temperature;

        @JsonProperty("max_tokens")

        private Integer maxTokens;

        @JsonProperty("top_p")

        private Double topP;

        private Boolean stream;

        private List<String> stop;

    }

    @Data

    @Builder

    @NoArgsConstructor

    @AllArgsConstructor

    public static class Message {

        private String role;

        private String content;

    }

    @Data

    @Builder

    @NoArgsConstructor

    @AllArgsConstructor

    public static class ChatCompletionResponse {

        private String id;

        private String object;

        private Long created;

        private String model;

        private List<Choice> choices;

        private Usage usage;

    }

    @Data

    @Builder

    @NoArgsConstructor

    @AllArgsConstructor

    public static class Choice {

        private Integer index;

        private Message message;

        @JsonProperty("finish_reason")

        private String finishReason;

    }

    @Data

    @Builder

    @NoArgsConstructor

    @AllArgsConstructor

    public static class Usage {

        @JsonProperty("prompt_tokens")

        private Integer promptTokens;

        @JsonProperty("completion_tokens")

        private Integer completionTokens;

        @JsonProperty("total_tokens")

        private Integer totalTokens;

    }

}

```

Spring Boot服务实现

```java

package com.example.llm.service;

import com.example.llm.client.LLMClient.ChatCompletionRequest;

import com.example.llm.client.LLMClient.ChatCompletionResponse;

import com.example.llm.client.LLMClient.LLMClient;

import com.example.llm.client.LLMClient.Message;

import lombok.RequiredArgsConstructor;

import lombok.extern.slf4j.Slf4j;

import org.springframework.stereotype.Service;

import java.util.Arrays;

import java.util.List;

import java.util.concurrent.CompletableFuture;

import java.util.concurrent.TimeUnit;

@Service

@RequiredArgsConstructor

@Slf4j

public class LLMService {

    private final LLMClient llmClient;

    /**

     * 同步对话

     */

    public String chat(String userMessage) {

        return chat(userMessage, null);

    }

    /**

     * 同步对话(带系统提示)

     */

    public String chat(String userMessage, String systemPrompt) {

        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, false);

        try {

            ChatCompletionResponse response = llmClient.chatCompletion(request);

            if (response != null && response.getChoices() != null && !response.getChoices().isEmpty()) {

                return response.getChoices().get(0).getMessage().getContent();

            }

        } catch (Exception e) {

            log.error("LLM调用失败", e);

            throw new RuntimeException("LLM服务调用失败: " + e.getMessage(), e);

        }

        return null;

    }

    /**

     * 流式对话

     */

    public Flux<String> chatStream(String userMessage) {

        return chatStream(userMessage, null);

    }

    /**

     * 流式对话(带系统提示)

     */

    public Flux<String> chatStream(String userMessage, String systemPrompt) {

        ChatCompletionRequest request = buildRequest(userMessage, systemPrompt, true);

        return llmClient.chatCompletionStream(request)

            .filter(data -> !data.startsWith("[DONE]"))

            .map(data -> {

                // 解析SSE数据

                if (data.startsWith("data:")) {

                    data = data.substring(5).trim();

                }

                return data;

            });

    }

    private ChatCompletionRequest buildRequest(String userMessage, String systemPrompt, boolean stream) {

        List<Message> messages = Arrays.asList(

            Message.builder()

                .role("system")

                .content(systemPrompt != null ? systemPrompt : "你是一个有帮助的AI助手。")

                .build(),

            Message.builder()

                .role("user")

                .content(userMessage)

                .build()

        );

        return ChatCompletionRequest.builder()

            .model("Qwen/Qwen-7B-Chat")

            .messages(messages)

            .temperature(0.7)

            .maxTokens(2048)

            .stream(stream)

            .build();

    }

    /**

     * 异步对话

     */

    public CompletableFuture<String> chatAsync(String userMessage) {

        return CompletableFuture.supplyAsync(() -> chat(userMessage))

            .orTimeout(120, TimeUnit.SECONDS);

    }

    /**

     * 批量对话

     */

    public List<String> batchChat(List<String> messages) {

        return messages.parallelStream()

            .map(this::chat)

            .toList();

    }

}

```

控制器实现

```java

package com.example.llm.controller;

import com.example.llm.service.LLMService;

import lombok.RequiredArgsConstructor;

import lombok.extern.slf4j.Slf4j;

import org.springframework.http.MediaType;

import org.springframework.http.codec.ServerSentEvent;

import org.springframework.web.bind.annotation.*;

import reactor.core.publisher.Flux;

import reactor.core.publisher.Mono;

import java.util.HashMap;

import java.util.Map;

@RestController

@RequestMapping("/api/llm")

@RequiredArgsConstructor

@Slf4j

public class LLMController {

    private final LLMService llmService;

    /**

     * 同步对话接口

     */

    @PostMapping("/chat")

    public Map<String, Object> chat(@RequestBody Map<String, String> request) {

        String userMessage = request.get("message");

        String systemPrompt = request.get("systemPrompt");

        log.info("收到聊天请求: {}", userMessage);

        String response = llmService.chat(userMessage, systemPrompt);

        Map<String, Object> result = new HashMap<>();

        result.put("code", 200);

        result.put("message", "success");

        result.put("data", response);

        return result;

    }

    /**

     * 流式对话接口

     */

    @PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)

    public Flux<ServerSentEvent<String>> chatStream(@RequestBody Map<String, String> request) {

        String userMessage = request.get("message");

        String systemPrompt = request.get("systemPrompt");

        log.info("收到流式聊天请求: {}", userMessage);

        return llmService.chatStream(userMessage, systemPrompt)

            .map(data -> ServerSentEvent.<String>builder()

                .data(data)

                .build()

            )

            .onErrorResume(e -> {

                log.error("流式对话异常", e);

                return Mono.just(ServerSentEvent.<String>builder()

                    .data("[ERROR]" + e.getMessage())

                    .build());

            })

            .concatWith(Mono.just(ServerSentEvent.<String>builder()

                .data("[DONE]")

                .build()));

    }

    /**

     * 异步对话接口

     */

    @PostMapping("/chat/async")

    public Map<String, Object> chatAsync(@RequestBody Map<String, String> request) {

        String userMessage = request.get("message");

        log.info("收到异步聊天请求: {}", userMessage);

        // 立即返回,后续异步处理

        llmService.chatAsync(userMessage)

            .thenAccept(response -> {

                log.info("异步任务完成: {}", response);

                // 这里可以添加回调逻辑,如发送WebSocket消息

            })

            .exceptionally(e -> {

                log.error("异步任务失败", e);

                return null;

            });

        Map<String, Object> result = new HashMap<>();

        result.put("code", 202);

        result.put("message", "请求已接受,正在异步处理");

        return result;

    }

}

```

熔断与重试配置

```java

package com.example.llm.config;

import io.github.resilience4j.circuitbreaker.CircuitBreaker;

import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;

import io.github.resilience4j.retry.Retry;

import io.github.resilience4j.retry.RetryConfig;

import lombok.extern.slf4j.Slf4j;

import org.springframework.context.annotation.Bean;

import org.springframework.context.annotation.Configuration;

import org.springframework.web.client.RestTemplate;

import java.time.Duration;

@Configuration

@Slf4j

public class ResilienceConfig {

    @Bean

    public CircuitBreaker llmCircuitBreaker() {

        CircuitBreakerConfig config = CircuitBreakerConfig.custom()

            .failureRateThreshold(50)                    // 失败率阈值50%

            .slowCallRateThreshold(80)                    // 慢调用率阈值80%

            .slowCallDurationThreshold(Duration.ofSeconds(30))  // 慢调用阈值30秒

            .waitDurationInOpenState(Duration.ofMinutes(1))     // 熔断器开放时间1分钟

            .permittedNumberOfCallsInHalfOpenState(3)    // 半开状态允许调用次数

            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)

            .slidingWindowSize(10)                        // 滑动窗口大小

            .minimumNumberOfCalls(5)                      // 最小调用次数

            .build();

        return CircuitBreaker.of("llmService", config);

    }

    @Bean

    public Retry llmRetry() {

        RetryConfig config = RetryConfig.custom()

            .maxAttempts(3)                               // 最大重试次数

            .waitDuration(Duration.ofSeconds(2))          // 重试间隔

            .retryExceptions(Exception.class)             // 重试的异常类型

            .ignoreExceptions(IllegalArgumentException.class)  // 忽略的异常

            .build();

        return Retry.of("llmService", config);

    }

}

```

应用配置

```yaml

application.yml

spring:

  application:

    name: llm-spring-boot-client

server:

  port: 8080

llm:

  # 大模型服务地址

  base-url: http://localhost:80

  # 连接超时(毫秒)

  connect-timeout: 10000

  # 读取超时(毫秒)

  read-timeout: 120000

熔断器配置

resilience4j:

  circuitbreaker:

    configs:

      default:

        failureRateThreshold: 50

        waitDurationInOpenState: 60s

    instances:

      llmService:

        registerHealthIndicator: true

        slidingWindowSize: 10

日志配置

logging:

  level:

    com.example.llm: DEBUG

    org.springframework.web.client: DEBUG

```

DockerSpring Boot应用

如果需要将Spring Boot应用也容器化部署,可以编写Dockerfile:

```dockerfile

Multi-stage build

FROM eclipse-temurin:17-jdk-alpine AS builder

WORKDIR /app

复制Maven配置

COPY pom.xml .

COPY src ./src

下载依赖(利用缓存)

RUN ./mvnw dependency:go-offline -B

构建

RUN ./mvnw package -DskipTests -B

运行阶段

FROM eclipse-temurin:17-jre-alpine

WORKDIR /app

安全配置

RUN addgroup -S appgroup && adduser -S appuser -G appgroup

RUN chown -R appuser:appgroup /app

复制jar包

COPY --from=builder /app/target/*.jar app.jar

复制模型缓存(可选,共享HuggingFace缓存)

VOLUME ["/root/.cache/huggingface"]

USER appuser

EXPOSE 8080

ENTRYPOINT ["java", "-jar", "-Xms2g", "-Xmx4g", "-XX:+UseG1GC", "app.jar"]

```

相应的docker-compose配置:

```yaml

version: '3.8'

services:

  spring-boot-app:

    build: ./spring-boot-app

    container_name: spring-boot-llm-client

    ports:

      - "8080:8080"

    environment:

      - LLM_BASE_URL=http://qwen7b-gateway:80

      - SPRING_PROFILES_ACTIVE=production

    depends_on:

      - api-gateway

    networks:

      - qwen-net

  api-gateway:

    image: nginx:alpine

    container_name: qwen7b-gateway

    ports:

      - "80:80"

    volumes:

      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro

    depends_on:

      - qwen7b-inference

    networks:

      - qwen-net

  qwen7b-inference:

    image: vllm/vllm-openai:latest

    container_name: qwen7b-inference

    environment:

      - MODEL_NAME=Qwen/Qwen-7B-Chat

      - GPU_MEMORY_UTILIZATION=0.9

    volumes:

      - model_cache:/root/.cache/huggingface

    deploy:

      resources:

        reservations:

          devices:

            - driver: nvidia

              count: 1

              capabilities: [gpu]

    shm_size: '16gb'

    networks:

      - qwen-net

networks:

  qwen-net:

    driver: bridge

volumes:

  model_cache:

```

总结

Docker容器化技术为大模型的部署提供了标准化、可复制、易扩展的解决方案。通过本文的详细介绍,我们已经覆盖了从环境配置、镜像制作、服务编排到生产部署的完整流程。

**核心要点回顾:**

1. **环境配置**:正确安装nvidia-container-toolkit是GPU容器化的前提,CUDA版本兼容性需要仔细规划。

2. **镜像制作**:选择合适的Base Image,合理利用多阶段构建减小镜像体积,通过层缓存优化构建速度。

3. **服务编排**:Docker Compose适合单主机多容器编排,Kubernetes适合大规模生产部署。

4. **性能优化**:合理的内存配置、批量推理、流式输出等优化手段可显著提升推理吞吐量。

5. **安全加固**:以非root用户运行、限制容器能力、启用TLS加密、进行安全扫描是基本的安全实践。

6. **Spring Boot对接**:通过封装统一的API客户端,Spring Boot应用可以方便地调用容器化部署的大模型服务。

随着容器化和云原生技术的持续发展,大模型的部署将变得更加简单高效。希望本文能够为你的大模型生产部署提供有价值的参考。

Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐