本地服务器部署vllm+Qwen3-Coder-Next的模型

我叫Double

11人浏览 · 2026-05-13 23:52:59

我叫Double · 2026-05-13 23:52:59 发布

首先我的服务器配置是

显卡型号：NVIDIA RTX PRO 6000 Black Edition（专业级高端显卡）
显存规格：总显存 97887 MiB（约 95.6 GB），当前已使用 87788 MiB（约 85.7 GB），剩余约 9.9 GB
驱动 / 算力：NVIDIA 驱动版本 580.119.02，CUDA 版本 13.0（适配高版本深度学习框架）

nvidia-smi
Thu Mar 12 20:23:09 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.119.02             Driver Version: 580.119.02     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   36C    P8             19W /  600W |   87788MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           57161      C   VLLM::EngineCore                      87728MiB |
|    0   N/A  N/A          140517      G   /usr/lib/xorg/Xorg                       40MiB |
+-----------------------------------------------------------------------------------------+

第一步安装虚拟环境的配置

原本的python环境是python 1.13.3
安装miniconda3 用来准备运行环境
配置系统的镜像源（这里使用的是华为的镜像源，因为是内网环境，对镜像源有要求）

sudo sed -i "s@http://.*security.ubuntu.com@http://mirrors.tools.huawei.com@g" ubuntu.sources
sudo sed -i "s@http://.*archive.ubuntu.com@http://mirrors.tools.huawei.com@g" ubuntu.sources



pip config set global.trusted-host mirrors.tools.huawei.com
pip config set global.index-url http://mirrors.tools.huawei.com/pypi/simple/

在miniconda3的解压包下找到.condarc文件，里面也配一下镜像源

channels:
  - defaults
show_channel_urls: true
default_channels:
   - main
channel_alias: http://conda.rnd.huawei.com/repository/conda-proxy
channel_priority: strict
env_dirs:
   - /opt/miniconda3/envs

环境配置好之后创建一个合适版本的的虚拟环境

conda create --name VLLM python=3.13 -y 创建虚拟环境

第二步安装vllm的前置条件

sudo apt update （执行可以忽略）
sudo apt install -y build-essential dkm （执行可以忽略）
sudo update-initramfs -u
安装nivdia驱动执行nvidia-smi有内容算成功
下载nivdia-smi对应的cuda驱动 https://developer.nvidia.com/cuda-toolkit-archive 对应自己的系统的版本比如我的是cuda_13.0.0_580.65.06_linux.run 我的系统版本就是13.0，我的下载到/home/zhike/data/program下

sudo chmod a+x cuda_13.0.0_580.65.06_linux.run

ssh bash cuda_13.0.0_580.65.06_linux.run
配置环境变量
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-13.0/lib64
export PATH=$PATH:/usr/local/cuda-13.0/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-13.0

第三步下载pytorch

下载pytorch选择对应的cuda和python版本 https://download.pytorch.org/whl/torch/
source ~/.bashrc
conda activate vLLM 进去虚拟环境后进行安装torch的whl文件
pip install torch-2.10.0+cu130-cp313-cp313-manylinux_2_28_x86_64.whl
安装成功后pip install vLLM
安装成功后执行 vllm --version 后显示版本说明安装成功了

下载 Qwen3-Coder-Next 大模型

将Qwen3-Coder-Next 下载想要放置的位置
使用huggingface 或者Modelscope 或者git下载下来（安装的vllm自带了huggingface）
切记要是使用源文件，不要指针文件
然后cd到下载到的文件目录下执行启动程序

第四步启动

这个是可以启动量化后的Qwen3-Coder-Next，但是原生的跑不起来

python -m vllm.entrypoints.api_server     --model .     --trust-remote-code     --port 8000     --host 0.0.0.0     --load-format safetensors     --dtype bfloat16     --tokenizer-mode slow     --gpu-memory-utilization 0.9     --max-model-len 8192     > /tmp/vllm.log

这个控制了量化，启动原生的Qwen3-Coder-Next

python -m vllm.entrypoints.api_server --model . --trust-remote-code --port 8000 --host 0.0.0.0 --dtype float16 --gpu-memory-utilization 0.85 --max-model-len 2048 --quantization fp8 --enforce-eager --swap-space 10

这个可以启动原生的Qwen3-Coder-Next并且打开chat模式

python -m vllm.entrypoints.openai.api_server --model . --trust-remote-code --port 8000 --host 100.102.39.213 --dtype float16 --gpu-memory-utilization 0.85 --max-model-len 2048 --quantization fp8 --enforce-eager --swap-space 10 --served-model-name qwen3

使用fp8量化的配置参数参考①：

nohup python -m vllm.entrypoints.openai.api_server \
  --model /home/zhike/data/ModelScope/Qwen3-Coder-Next \
  --trust-remote-code \
  --port 8000 \
  --host 100.102.39.213 \
  --gpu-memory-utilization 0.92 \
  --quantization fp8 \
  --served-model-name Qwen/Qwen3-Coder-Next \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --max-model-len 65536 \      # ← 放弃 128K，改用 64K（更现实）
  --enable-prefix-caching \
  --max-num-seqs 8 \          # ← ≥4 并发，实际可达 8
  --max-num-batched-tokens 65536 \
  --block-size 32 \           # ← 提升至 32（Qwen 友好）
  --max-logprobs 20 \
  --disable-log-stats > ~/logs/vllm/vllm_$(date +%Y%m%d_%H%M%S).log 2>&1 &

量化nvfpv问题

转换代码：

import os
import ssl

# === 关键：强制使用官方 Hugging Face Hub ===
os.environ["HF_ENDPOINT"] = "https://huggingface.co"

# 如仍有 SSL 问题（极少见），再额外禁用验证（仅内网环境）：
# os.environ["CURL_CA_BUNDLE"] = ""
# os.environ["REQUESTS_CA_BUNDLE"] = ""
# ssl._create_default_https_context = ssl._create_unverified_context

# === 正常导入 ===
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

# 1. 配置路径
MODEL_PATH = "/home/zhike/data/ModelScope/Qwen3-Coder-Next"
OUTPUT_PATH = "/home/zhike/data/ModelScope/Qwen3-Coder-Next-NVFP4"

# 2. 加载模型和分词器
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# 3. 准备校准数据集 → 此处不再报错！
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
dataset = dataset.select(range(512))

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
dataset = dataset.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=2048, truncation=True, add_special_tokens=False)
dataset = dataset.map(tokenize, remove_columns=dataset.column_names)

# 4. 配置 NVFP4 量化方案
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])

# 5. 执行量化
oneshot(model=model, dataset=dataset, recipe=recipe, max_seq_length=2048)

# 6. 保存模型
model.save_pretrained(OUTPUT_PATH, save_compressed=True)
tokenizer.save_pretrained(OUTPUT_PATH)
print(f"✅ NVFP4 模型已保存至: {OUTPUT_PATH}")

报错：

(EngineCore pid=2958568) /bin/sh: 1: :/usr/local/cuda-13.0/bin/nvcc: not found
(EngineCore pid=2958568) ninja: build stopped: subcommand failed.
(EngineCore pid=2958568) 
[rank0]:[W423 11:03:59.936941636 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2958399) Traceback (most recent call last):
(APIServer pid=2958399)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=2958399)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 710, in <module>
(APIServer pid=2958399)     uvloop.run(run_server(args))
(APIServer pid=2958399)     ~~~~~~~~~~^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=2958399)     return __asyncio.run(
(APIServer pid=2958399)            ~~~~~~~~~~~~~^
(APIServer pid=2958399)         wrapper(),
(APIServer pid=2958399)         ^^^^^^^^^^
(APIServer pid=2958399)     ...<2 lines>...
(APIServer pid=2958399)         **run_kwargs
(APIServer pid=2958399)         ^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/asyncio/runners.py", line 195, in run
(APIServer pid=2958399)     return runner.run(main)
(APIServer pid=2958399)            ~~~~~~~~~~^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/asyncio/runners.py", line 118, in run
(APIServer pid=2958399)     return self._loop.run_until_complete(task)
(APIServer pid=2958399)            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
(APIServer pid=2958399)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=2958399)     return await main
(APIServer pid=2958399)            ^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=2958399)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=2958399)     async with build_async_engine_client(
(APIServer pid=2958399)                ~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         args,
(APIServer pid=2958399)         ^^^^^
(APIServer pid=2958399)         client_config=client_config,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ) as engine_client:
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 214, in __aenter__
(APIServer pid=2958399)     return await anext(self.gen)
(APIServer pid=2958399)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=2958399)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2958399)                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         engine_args,
(APIServer pid=2958399)         ^^^^^^^^^^^^
(APIServer pid=2958399)         usage_context=usage_context,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)         client_config=client_config,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ) as engine:
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 214, in __aenter__
(APIServer pid=2958399)     return await anext(self.gen)
(APIServer pid=2958399)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=2958399)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2958399)         vllm_config=vllm_config,
(APIServer pid=2958399)     ...<6 lines>...
(APIServer pid=2958399)         client_index=client_index,
(APIServer pid=2958399)     )
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=2958399)     return cls(
(APIServer pid=2958399)         vllm_config=vllm_config,
(APIServer pid=2958399)     ...<9 lines>...
(APIServer pid=2958399)         client_index=client_index,
(APIServer pid=2958399)     )
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=2958399)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2958399)                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         vllm_config=vllm_config,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ...<4 lines>...
(APIServer pid=2958399)         client_index=client_index,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=2958399)     return func(*args, **kwargs)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=2958399)     return AsyncMPClient(*client_args)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=2958399)     return func(*args, **kwargs)
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=2958399)     super().__init__(
(APIServer pid=2958399)     ~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         asyncio_mode=True,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ...<3 lines>...
(APIServer pid=2958399)         client_addresses=client_addresses,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=2958399)     with launch_core_engines(
(APIServer pid=2958399)          ~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         vllm_config, executor_class, log_stats, addresses
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ) as (engine_manager, coordinator, addresses, tensor_queue):
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 148, in __exit__
(APIServer pid=2958399)     next(self.gen)
(APIServer pid=2958399)     ~~~~^^^^^^^^^^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=2958399)     wait_for_engine_startup(
(APIServer pid=2958399)     ~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399)         handshake_socket,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     ...<6 lines>...
(APIServer pid=2958399)         coordinator.proc if coordinator else None,
(APIServer pid=2958399)         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399)     )
(APIServer pid=2958399)     ^
(APIServer pid=2958399)   File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=2958399)     raise RuntimeError(
(APIServer pid=2958399)     ...<3 lines>...
(APIServer pid=2958399)     )
(APIServer pid=2958399) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

执行： python -m vllm.entrypoints.openai.api_server --model /home/zhike/data/ModelScope/Qwen3-Coder-Next-NVFP4 --trust-remote-code --port 8000 --host 100.102.39.213 --gpu-memory-utilization 0.90 --served-model-name Qwen/Qwen3-Coder-Next --enable-auto-tool-choice --tool-call-parser qwen3_coder --max-model-len 32768 --enable-prefix-caching --max-num-seqs 16 --max-num-batched-tokens 131072 --block-size 64 --max-logprobs 20 --disable-log-stats --enforce-eage

本地查询：

(base) zhike@zhike:~/data$ which nvcc
/usr/local/cuda-13.0/bin/nvcc
(base) zhike@zhike:~/data$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jul_16_07:30:01_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.48
Build cuda_13.0.r13.0/compiler.36260728_0
(base) zhike@zhike:~/data$ nvidia-smi
Thu Apr 23 11:47:29 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.119.02             Driver Version: 580.119.02     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   35C    P8             19W /  600W |   91455MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3926      G   /usr/lib/xorg/Xorg                       75MiB |
|    0   N/A  N/A         3014542      C   VLLM::EngineCore                      91352MiB |
+-----------------------------------------------------------------------------------------+
(base) zhike@zhike:~/data$ echo $PATH
/usr/local/cuda-13.0/bin:/home/zhike/.local/bin:/opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-13.0/bin

(base) zhike@zhike:~/data$ echo 'export CUDA_HOME=/usr/local/cuda-13.0' 
export CUDA_HOME=/usr/local/cuda-13.0

解决：
出现找不到nvcc的情况

unset CUDA_HOME NVCC CUDA_PATH

# 2. 显式指定正确路径（注意不要带前导/后导冒号）
export CUDA_HOME=/usr/local/cuda-13.0
export NVCC=$CUDA_HOME/bin/nvcc
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# 3. 验证 nvcc 是否可正常调用
$NVCC --version

rm -rf ~/.cache/flashinfer
# 若使用 VLLM 默认缓存目录，也可清理：
rm -rf ~/.cache/vllm

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐