qwen模型部署流程

zbdx不知名菜鸡

303人浏览 · 2026-03-26 00:30:44

zbdx不知名菜鸡 · 2026-03-26 00:30:44 发布

qwen部署流程

1. 环境安装

conda create -n qwen-server python=3.10 -y
conda activate qwen-server

# 安装 vLLM（支持 Qwen2.5 和 AWQ）
pip install vllm>=0.6.0
pip install transformers>=4.40.0 accelerate

### 2. 下载模型（自动或手动）

# 方法1：自动下载（启动时会自动拉取）
# 方法2：手动下载（推荐，使用 modelscope 加速）
pip install modelscope
python -c "
from modelscope import snapshot_download
snapshot_download('qwen/Qwen2.5-72B-Instruct-AWQ', cache_dir='./models')
"

3. 服务端启动命令（核心）

export CUDA_VISIBLE_DEVICES=0,1,2
export MODEL_PATH="./models/qwen/Qwen2.5-72B-Instruct-AWQ"  # 或 HuggingFace 路径

python -m vllm.entrypoints.openai.api_server \
    --model ./models/qwen/Qwen2.5-72B-Instruct-AWQ \
    --quantization awq \
    --pipeline-parallel-size 3 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.93 \
    --max-model-len 16384 \
    --max-num-seqs 16 \              # 降低并发数（256太多）
    --enforce-eager \                # 关键：禁用 CUDA Graphs
    --port 8000 \
    --host 0.0.0.0 \
    --served-model-name qwen-72b

关键参数解释：

--pipeline-parallel-size 3：3 张卡流水线并行（避免 heads 整除问题）
--quantization awq：4-bit 量化，72B 模型仅需约 40GB 显存
--host 0.0.0.0：允许局域网其他机器访问
--gpu-memory-utilization 0.95：充分利用 72GB 显存

根据你的 72GB 总显存（3×24GB），推荐配置：

配置	max_model_len	最大并发	适用场景
保守稳定	8192 (8K)	16-32	常规对话、RAG
平衡推荐	16384 (16K)	8-16	长文档分析
极限配置	20480 (20K)	4-8	代码生成（风险高）
不可行	32768 (32K)	-	显存不足，会报错

4. 防火墙/网络配置

# 开放 8000 端口（根据你的实际端口）
sudo firewall-cmd --permanent --add-port=8000/tcp
sudo firewall-cmd --reload

# 或使用 iptables
sudo iptables -I INPUT -p tcp --dport 8000 -j ACCEPT

客户端调用方式

其他电脑通过 HTTP 访问，支持 OpenAI 兼容 API：

Python 客户端

from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.1.100:8000/v1",  # 例如：192.168.1.100:8000
    api_key="dummy"  # vLLM 不需要真实 key
)

response = client.chat.completions.create(
    model="Qwen2.5-72B-Instruct-AWQ",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "解释三卡部署大模型的优势和注意事项"}
    ],
    temperature=0.7,
    max_tokens=2048
)

print(response.choices[0].message.content)

curl 测试

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "./models/qwen/Qwen2.5-72B-Instruct-AWQ",
    "messages": [{"role": "user", "content": "你好"}],
    "temperature": 0.7
  }'