8 张 RTX 5090 跑 Qwen3.6-27B：从装 vLLM 到压测调优的真实数据（含完整脚本）

李燚

701人浏览 · 2026-05-20 07:00:00

李燚 · 2026-05-20 07:00:00 发布

这台 8 卡 5090 的机器，从把 vLLM 装到调到能稳定服务 65 QPS，过程、数字、脚本都在这里。所有数据均为实测，无估算、无推算；所有脚本均为线上可直接复用版本。

一、这台机器长什么样

部件	配置
GPU	8 × RTX 5090（32GB GDDR7，合计 256GB 显存）
CPU	2 × Intel Xeon Gold 6530（合计 128 线程）
内存	512GB DDR5-5600
存储	7TB NVMe（数据盘）+ 894GB SATA SSD（系统盘）
系统	Ubuntu 26.04 LTS + Kernel 6.17
驱动 / CUDA	NVIDIA 580.142 + CUDA 13.1

NUMA 拓扑是典型的 4+4：GPU 0-3 在 NUMA 0，GPU 4-7 在 NUMA 1。这台机器目前只用前 4 张卡跑 LLM，后 4 张留给图像生成。

这套硬件的硬伤先说在前面：

消费卡 5090 之间没有 NVLink，GPU 间通信只能走 PCIe，比专业卡慢 30-50%
nvidia-smi 看 P2P 矩阵全是 "Chipset not supported"——驱动层面禁用了 GeForce 卡的 P2P
这会让张量并行（TP）的 AllReduce 走主机内存中转，是吞吐天花板的主要来源

二、软件栈：选 vLLM 不选 SGLang，为什么

对比过 vLLM、SGLang、TensorRT-LLM 三家。最终选 vLLM 的理由：

框架	sm_120 (Blackwell) 支持	模型加载难度	选择
vLLM 0.20.2	主线支持，CUDA 13 + PyTorch 2.11 已默认	简单	✅
SGLang	FP8 blockwise 在 sm_120 上落后 vLLM 半年	中等	❌
TensorRT-LLM	NVIDIA 自家，性能最强	极高（每次改参数都要重编译）	❌

模型选了 Qwen3.6-27B（2026 年 4 月开源，密集 27B，原生 262K 上下文，SWE-bench 77.2%）。

关键细节：Qwen3.6-27B 有个已知问题——在 CUDA 13.2 上会输出乱码，必须用 CUDA 13.1 或 12.x。这台机器的 13.1 是安全区。

三、第一轮测试：默认参数下能跑多少？

启动成功后，先用混合业务负载压测：80% 直答（短输出）+ 20% 深度思考（长输出），最大输出 1024 token。

第一轮结果（max-num-seqs=64，max-model-len=65536）：

并发请求数	QPS	首字延迟 P99	完整响应延迟 P99
10	1.8	6.9 秒	24 秒
30	5.3	223 ms	19 秒
64	8.7	334 ms	27 秒

QPS 只有 8.7，明显被卡住了。

但 vLLM 内部日志暴露了真相：

Running: 64 reqs, Waiting: 0 reqs, GPU KV cache usage: 25%

并发被 max-num-seqs=64 死死卡住，KV cache 才用了 25%——还有 3 倍空间没动用。

四、第二轮：调参后的真实能力

改了五个参数：

参数	旧值	新值	为什么改
`max-num-seqs`	64	256	KV 才用 25%，能撑 4 倍并发
`max-model-len`	65536	16384	缩单请求上下文，腾空间给更多并发
`enable-chunked-prefill`	关	开	长 prompt 分块处理，不阻塞 decode
`max-num-batched-tokens`	默认	16384	chunked-prefill 单步处理 token 上限
`gpu-memory-utilization`	0.90	0.92	显存再多榨一点

重启后做了两组场景化压测。

场景一：一般对话（max_tokens=150-256）

并发请求数	QPS	首字延迟 P99	完整响应延迟 P99
10	4.7	172 ms	2.9 秒
30	15.1	209 ms	3.7 秒
64	22.9	255 ms	5.0 秒
128	29.1	458 ms	8.3 秒
200	32.2	694 ms	13.5 秒

场景二：短问答（max_tokens=50，模拟客服/翻译/简单查询）

并发请求数	QPS	首字延迟 P99	完整响应延迟 P99
10	13.4	168 ms	1.5 秒
30	28.1	193 ms	1.7 秒
64	45.2	292 ms	2.3 秒
128	60.2	419 ms	3.4 秒
200	64.9 ⭐	727 ms	5.0 秒
256	64.7	843 ms	6.3 秒

两个相邻档位 QPS 完全相同（64.9 vs 64.7）——这是教科书般的"算力到顶"信号。再加并发也只是让请求排队等。

真实生成速度

指标	数字
单实例总 token 吞吐	约 2500 tokens/秒
单请求生成速度（流式给单用户）	约 80 tokens/秒
单 token 延迟	约 18ms（用户体感非常流畅）
4 卡平摊单卡吞吐	约 625 tokens/秒/卡

五、QPS 上限究竟在哪？

短输出场景跑出 65 QPS 时，vLLM 日志显示：

Running: 241 reqs, KV cache 96%, Avg gen throughput: 2000 tok/s

这次卡在了 KV cache 96%。同时 256 并发档位 QPS 反而比 200 档下降一点点（64.7 vs 64.9），说明 KV cache 已经满到开始抢占。

一般场景峰值 32 QPS 时，KV cache 用到 81%，Running 195。

两组数据合起来说明：4 卡 5090 + 无 P2P 的物理上限是 2000-2500 tokens/秒。同样的硬件，业务输出长度决定 QPS 天花板。

六、不同业务场景下的真实 QPS 上限

业务场景	平均输出 token	单请求耗时	单实例 QPS 上限	数据来源
短问答 / 客服	30	0.4 秒	65	✅ 实测
翻译 / 简短改写	50	0.6 秒	约 50	推算
代码补全	80	1.0 秒	约 40	推算
一般对话	150	2 秒	32	✅ 实测
长文摘要	500	6 秒	约 10	推算

只有短问答和一般对话两档是实测，其它档位是按 2500 tok/s 总吞吐推算。

七、思考模式 vs 直答模式

Qwen3.6 是新一代"思考型"模型，默认开启 thinking。做了对比：

场景	直答模式耗时	思考模式耗时	思考链长度
"用一句话介绍北京"	0.44 秒	7.4 秒（被 max_token 截断）	2200+ token
"3 开关找灯泡谜题"	3.7 秒	7.4 秒（被截断）	2000+ token
"Python 斐波那契生成器"	0.85 秒	7.4 秒（被截断）	1800+ token

结论：生产环境 API 默认应该关闭思考模式，让客户端通过参数显式启用——简单对话不需要思考，复杂推理任务才需要。

八、优化空间还在哪？

当前 65 QPS / 32 QPS 是 BF16 精度下的成绩。后续还可以走的路：

优化方向	预期 QPS 提升	工作量
缩 max-model-len 到 8192	短输出场景可冲 100+ QPS	5 分钟
INT8 量化（W8A8）	+80-100%	半天
AWQ-W4 量化	+50-70%	半天
N-gram speculative decoding	+30-50%	1 天
拆 DP=2 + TP=2 双实例	+30-50%	数天

65 QPS 短问答 / 32 QPS 一般对话已经够用，先稳定运行，量化等系统跑顺再做。

九、几个"软细节"

第一个：消费卡跑生产 API，技术上没问题，法务上有灰色。NVIDIA GeForce 驱动 EULA 禁止在数据中心环境用 GeForce。国内执行松，但你应该知道。

第二个：Ubuntu 26.04 + Kernel 6.17 是 2026 年 4 月的新版本，几乎所有 ML 框架的官方测试目标都是 24.04。用新系统的代价是踩了好几天的 DKMS 编译坑。如果你还在做选择，强烈建议用 Ubuntu 24.04。

第三个：消费卡 8 张挤在一台机器里，整机峰值功耗 5.5kW。机房电力、散热、噪音都要规划。

十、总结

维度	数据
短问答场景 QPS	65（实测）
一般对话场景 QPS	32（实测）
单实例峰值 token 吞吐	2500 tokens/秒
单字延迟（用户体感）	18ms，流畅
首字延迟 P99（短输出 200 并发）	727 ms
适合业务类型	中等量级 API 服务、企业内部 AI 工具、对延迟敏感的实时应用
不适合	极高吞吐（千 QPS+ 的 C 端应用，需要集群）

8 张 RTX 5090 用 4 张跑 Qwen3.6-27B BF16，单台机器在短问答场景能稳定服务 65 QPS，在一般对话场景能扛 32 QPS。如果是更小的模型（如 Qwen3.6-7B），同样硬件 QPS 还能再翻一倍。

剩下的 4 张卡正在跑图像生成服务，下一篇会讲那部分。

数据就这些。值不值，每个团队的业务量级不同，自己算。

附录：完整可复现脚本

下面是这套服务实际在用的脚本，复制即可使用（路径按需替换）。

启动脚本 `/data/services/start-vllm.sh`

#!/bin/bash
set -e

MODEL_PATH="/data/models/llm/Qwen3.6-27B-source"
LOG_FILE="/data/logs/vllm.log"
PORT=8000

source /data/envs/llm/bin/activate

# Blackwell + RTX 5090 关键环境变量
export VLLM_ATTENTION_BACKEND=FLASHINFER
export NCCL_P2P_DISABLE=1
export NCCL_SHM_DISABLE=0
export NCCL_DEBUG=WARN
export NCCL_IB_DISABLE=1
export NCCL_DMABUF_ENABLE=1
export NCCL_CUMEM_ENABLE=1

# 只用 NUMA 0 的 4 张卡
export CUDA_VISIBLE_DEVICES=0,1,2,3

# vLLM 参数
export VLLM_USE_FLASHINFER_SAMPLER=0
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export PYTHONUNBUFFERED=1

# cuDNN 路径(让 PyTorch 找到 pip 装的 cuDNN)
SITE_PACKAGES=$(/data/envs/llm/bin/python -c "import site; print(site.getsitepackages()[0])")
export LD_LIBRARY_PATH=${SITE_PACKAGES}/nvidia/cudnn/lib:${SITE_PACKAGES}/nvidia/cublas/lib:${LD_LIBRARY_PATH}

# 绑定到 NUMA 0
exec numactl --cpunodebind=0 --membind=0 \
    vllm serve "$MODEL_PATH" \
        --served-model-name qwen3.6-27b \
        --host 0.0.0.0 \
        --port "$PORT" \
        --tensor-parallel-size 4 \
        --max-model-len 16384 \
        --gpu-memory-utilization 0.92 \
        --max-num-seqs 256 \
        --enable-chunked-prefill \
        --max-num-batched-tokens 16384 \
        --enable-prefix-caching \
        --enable-auto-tool-choice \
        --tool-call-parser qwen3_coder \
        --reasoning-parser qwen3 \
        --disable-custom-all-reduce \
        --trust-remote-code \
        2>&1 | tee -a "$LOG_FILE"

几个关键设计点

numactl --cpunodebind=0 --membind=0：把整个 vLLM 进程钉在 NUMA 0，避免跨 socket 访存。CPU 和内存都绑死。
VLLM_ATTENTION_BACKEND=FLASHINFER：Blackwell 不支持 FlashAttention-3，必须切 FlashInfer。
NCCL_P2P_DISABLE=1 + NCCL_DMABUF_ENABLE=1：消费卡 P2P 被禁的两个 workaround。
cuDNN/cuBLAS 路径注入：因为 cuDNN 是 pip 装在 conda env 里的，PyTorch 默认找不到，需要显式指定。
tool-call-parser qwen3_coder + reasoning-parser qwen3：让 vLLM 正确解析 Qwen3.6 的工具调用和思考链输出。

停止脚本 `/data/services/stop-vllm.sh`

#!/bin/bash
pkill -SIGTERM -f "VLLM::" 2>/dev/null
pkill -SIGTERM -f "vllm serve" 2>/dev/null
sleep 3
pkill -9 -f "VLLM::" 2>/dev/null
pkill -9 -f "vllm serve" 2>/dev/null
sleep 2
echo "=== 残余进程 ==="
ps aux | grep -E "VLLM|vllm" | grep -v grep || echo "  无残余"
echo "=== GPU 占用 ==="
nvidia-smi --query-compute-apps=pid,process_name --format=csv
echo "=== 端口 8000 ==="
sudo ss -lntp 2>/dev/null | grep ':8000 ' || echo "  端口空闲"

为什么要这样写：vLLM 启动后会派生若干个 VLLM::Worker_TPx 子进程，名字不含 "vllm"。一开始用 pkill -f "vllm" 杀不干净，每次重启都 OOM。这个脚本先发 SIGTERM 优雅退出，3 秒后再 SIGKILL 强杀，最后报告残余状态，可以一眼看清是否清干净。

一般场景压测脚本 `/tmp/realistic_test.py`

跑出 32 QPS 那组数据的脚本：

import asyncio, aiohttp, time, statistics, json, random

URL = "http://localhost:8000/v1/chat/completions"
DURATION = 30

PROMPTS = [
    "今天天气怎么样?",
    "推荐一首歌",
    "你好",
    "帮我翻译: hello world",
    "讲个笑话",
    "什么是 Python?",
    "1+1 等于几",
    "北京有什么景点",
    "怎么减肥",
    "明天会下雨吗",
]

async def one_request(session, results):
    payload = {
        "model": "qwen3.6-27b",
        "messages": [{"role": "user", "content": random.choice(PROMPTS)}],
        "max_tokens": 150,
        "temperature": 0.7,
        "stream": True,
        "chat_template_kwargs": {"enable_thinking": False},
    }
    t0 = time.perf_counter()
    first = None
    n = 0
    try:
        async with session.post(URL, json=payload, timeout=aiohttp.ClientTimeout(total=60)) as r:
            async for line in r.content:
                line = line.decode().strip()
                if not line.startswith("data: ") or line == "data: [DONE]":
                    continue
                try:
                    chunk = json.loads(line[6:])
                    delta = chunk["choices"][0].get("delta", {})
                    if delta.get("content") or delta.get("reasoning"):
                        if first is None:
                            first = time.perf_counter()
                        n += 1
                except Exception:
                    pass
        t1 = time.perf_counter()
        if first and n > 0:
            results.append({
                "ttft_ms": (first - t0) * 1000,
                "total_ms": (t1 - t0) * 1000,
                "tokens": n,
            })
    except Exception as e:
        results.append({"error": str(e)[:80]})

async def worker(session, results, stop_at):
    while time.perf_counter() < stop_at:
        await one_request(session, results)

async def run(concurrency, duration):
    print(f"\n并发 {concurrency} × {duration}s")
    results = []
    stop_at = time.perf_counter() + duration
    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=concurrency*2)) as session:
        tasks = [worker(session, results, stop_at) for _ in range(concurrency)]
        await asyncio.gather(*tasks)

    ok = [r for r in results if "error" not in r]
    err = [r for r in results if "error" in r]
    if not ok:
        print(f"  全部失败: {err[:1]}")
        return

    ttfts = sorted([r["ttft_ms"] for r in ok])
    totals = sorted([r["total_ms"] for r in ok])
    total_tokens = sum(r["tokens"] for r in ok)
    actual_qps = len(ok) / duration

    def pct(arr, p):
        i = max(0, min(len(arr)-1, int(len(arr)*p/100)))
        return arr[i]

    print(f"  成功 {len(ok)} 失败 {len(err)} | QPS {actual_qps:.1f} | gen ~{total_tokens/duration:.0f} chunk/s")
    print(f"  TTFT  P50/P99: {statistics.median(ttfts):.0f} / {pct(ttfts,99):.0f} ms")
    print(f"  时延  P50/P99: {statistics.median(totals):.0f} / {pct(totals,99):.0f} ms")

async def main():
    for c in [10, 30, 64, 128, 200]:
        await run(c, DURATION)

asyncio.run(main())

短输出场景压测脚本 `/tmp/short_output_test.py`

跑出 65 QPS 那组数据的脚本：

import asyncio, aiohttp, time, statistics, json, random

URL = "http://localhost:8000/v1/chat/completions"
MODEL = "qwen3.6-27b"
DURATION = 30

SHORT_PROMPTS = [
    "1+1=?", "中国首都是哪里", "今天周几", "你好",
    "Python 怎么读取文件", "翻译: cat", "什么是 GPU",
    "JavaScript 缩写", "圆周率前 5 位", "推荐一本书",
    "周末快乐怎么说", "Hello", "1024 是 2 的几次方",
    "HTTP 默认端口", "微信英文是",
]

async def one_request(session, results):
    payload = {
        "model": MODEL,
        "messages": [{"role": "user", "content": random.choice(SHORT_PROMPTS)}],
        "max_tokens": 50,
        "temperature": 0.7,
        "stream": True,
        "chat_template_kwargs": {"enable_thinking": False},
    }
    t0 = time.perf_counter()
    first = None
    n = 0
    try:
        async with session.post(URL, json=payload, timeout=aiohttp.ClientTimeout(total=30)) as r:
            async for line in r.content:
                line = line.decode().strip()
                if not line.startswith("data: ") or line == "data: [DONE]":
                    continue
                try:
                    chunk = json.loads(line[6:])
                    delta = chunk["choices"][0].get("delta", {})
                    if delta.get("content") or delta.get("reasoning"):
                        if first is None:
                            first = time.perf_counter()
                        n += 1
                except Exception:
                    pass
        t1 = time.perf_counter()
        if first and n > 0:
            results.append({
                "ttft_ms": (first - t0) * 1000,
                "total_ms": (t1 - t0) * 1000,
                "tokens": n,
            })
    except Exception as e:
        results.append({"error": str(e)[:80]})

async def worker(session, results, stop_at):
    while time.perf_counter() < stop_at:
        await one_request(session, results)

async def run(concurrency, duration):
    print(f"\n并发 {concurrency} × {duration}s (短输出: max_tokens=50)")
    results = []
    stop_at = time.perf_counter() + duration
    async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=concurrency*2)) as session:
        tasks = [worker(session, results, stop_at) for _ in range(concurrency)]
        await asyncio.gather(*tasks)

    ok = [r for r in results if "error" not in r]
    err = [r for r in results if "error" in r]
    if not ok:
        print(f"  全部失败: {err[:1]}")
        return

    ttfts = sorted([r["ttft_ms"] for r in ok])
    totals = sorted([r["total_ms"] for r in ok])
    token_counts = [r["tokens"] for r in ok]
    avg_tokens = sum(token_counts) / len(token_counts)
    actual_qps = len(ok) / duration

    def pct(arr, p):
        i = max(0, min(len(arr)-1, int(len(arr)*p/100)))
        return arr[i]

    print(f"  成功 {len(ok)} 失败 {len(err)} | QPS {actual_qps:.1f} | 平均输出 {avg_tokens:.0f} chunks")
    print(f"  TTFT  P50/P99: {statistics.median(ttfts):.0f} / {pct(ttfts,99):.0f} ms")
    print(f"  总时延 P50/P99: {statistics.median(totals):.0f} / {pct(totals,99):.0f} ms")

async def main():
    print("=" * 70)
    print("短输出场景压测: max_tokens=50, 短问题, 关闭思考模式")
    print("=" * 70)
    for c in [10, 30, 64, 128, 200, 256]:
        await run(c, DURATION)

asyncio.run(main())

完整启动到压测的流程

# 1. 启动 vLLM(前台跑,确认成功后 Ctrl+B+D 放后台,或挂 systemd)
bash /data/services/start-vllm.sh

# 等到日志里出现:
# Uvicorn running on http://0.0.0.0:8000

# 2. 验证服务在线
curl -sf http://localhost:8000/v1/models > /dev/null && echo "✅ 服务在线"

# 3. 预热(消除冷启动延迟)
seq 1 30 | xargs -P 30 -I {} curl -s http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"qwen3.6-27b","messages":[{"role":"user","content":"hi"}],
         "max_tokens":30,"chat_template_kwargs":{"enable_thinking":false}}' > /dev/null

# 4. 跑短输出压测(冲峰值 QPS)
python3 /tmp/short_output_test.py

# 5. 跑一般场景压测(中位 QPS)
python3 /tmp/realistic_test.py

# 6. 停止服务(测完释放显存)
bash /data/services/stop-vllm.sh

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐