本地服务器部署vllm+Qwen3-Coder-Next的模型
·
首先我的服务器配置是
- 显卡型号:NVIDIA RTX PRO 6000 Black Edition(专业级高端显卡)
- 显存规格:总显存 97887 MiB(约 95.6 GB),当前已使用 87788 MiB(约 85.7 GB),剩余约 9.9 GB
- 驱动 / 算力:NVIDIA 驱动版本 580.119.02,CUDA 版本 13.0(适配高版本深度学习框架)
nvidia-smi
Thu Mar 12 20:23:09 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.119.02 Driver Version: 580.119.02 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 6000 Blac... Off | 00000000:01:00.0 Off | Off |
| 30% 36C P8 19W / 600W | 87788MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 57161 C VLLM::EngineCore 87728MiB |
| 0 N/A N/A 140517 G /usr/lib/xorg/Xorg 40MiB |
+-----------------------------------------------------------------------------------------+
第一步安装虚拟环境的配置
- 原本的python环境是python 1.13.3
- 安装miniconda3 用来准备运行环境
- 配置系统的镜像源(这里使用的是华为的镜像源,因为是内网环境,对镜像源有要求)
sudo sed -i "s@http://.*security.ubuntu.com@http://mirrors.tools.huawei.com@g" ubuntu.sources sudo sed -i "s@http://.*archive.ubuntu.com@http://mirrors.tools.huawei.com@g" ubuntu.sources pip config set global.trusted-host mirrors.tools.huawei.com pip config set global.index-url http://mirrors.tools.huawei.com/pypi/simple/
在miniconda3的解压包下找到.condarc文件,里面也配一下镜像源
channels: - defaults show_channel_urls: true default_channels: - main channel_alias: http://conda.rnd.huawei.com/repository/conda-proxy channel_priority: strict env_dirs: - /opt/miniconda3/envs
环境配置好之后创建一个合适版本的的虚拟环境
conda create --name VLLM python=3.13 -y 创建虚拟环境
第二步安装vllm的前置条件
- sudo apt update (执行可以忽略)
- sudo apt install -y build-essential dkm (执行可以忽略)
- sudo update-initramfs -u
- 安装nivdia驱动 执行nvidia-smi有内容算成功
- 下载nivdia-smi对应的cuda驱动 https://developer.nvidia.com/cuda-toolkit-archive 对应自己的系统的版本 比如我的是cuda_13.0.0_580.65.06_linux.run 我的系统版本就是13.0,我的下载到/home/zhike/data/program下
- sudo chmod a+x cuda_13.0.0_580.65.06_linux.run
- ssh bash cuda_13.0.0_580.65.06_linux.run
- 配置环境变量
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-13.0/lib64 export PATH=$PATH:/usr/local/cuda-13.0/bin export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-13.0
第三步下载pytorch
- 下载pytorch选择对应的cuda和python版本 https://download.pytorch.org/whl/torch/
- source ~/.bashrc
- conda activate vLLM 进去虚拟环境后进行安装torch的whl文件
- pip install torch-2.10.0+cu130-cp313-cp313-manylinux_2_28_x86_64.whl
- 安装成功后pip install vLLM
- 安装成功后执行 vllm --version 后显示版本说明安装成功了
下载 Qwen3-Coder-Next 大模型
- 将Qwen3-Coder-Next 下载想要放置的位置
- 使用huggingface 或者Modelscope 或者git下载下来(安装的vllm自带了huggingface)
- 切记要是使用源文件,不要指针文件
- 然后cd到下载到的文件目录下执行启动程序
第四步启动
- 这个是可以启动量化后的Qwen3-Coder-Next,但是原生的跑不起来
-
python -m vllm.entrypoints.api_server --model . --trust-remote-code --port 8000 --host 0.0.0.0 --load-format safetensors --dtype bfloat16 --tokenizer-mode slow --gpu-memory-utilization 0.9 --max-model-len 8192 > /tmp/vllm.log - 这个控制了量化,启动原生的Qwen3-Coder-Next
-
python -m vllm.entrypoints.api_server --model . --trust-remote-code --port 8000 --host 0.0.0.0 --dtype float16 --gpu-memory-utilization 0.85 --max-model-len 2048 --quantization fp8 --enforce-eager --swap-space 10 - 这个可以启动原生的Qwen3-Coder-Next并且打开chat模式
-
python -m vllm.entrypoints.openai.api_server --model . --trust-remote-code --port 8000 --host 100.102.39.213 --dtype float16 --gpu-memory-utilization 0.85 --max-model-len 2048 --quantization fp8 --enforce-eager --swap-space 10 --served-model-name qwen3
使用fp8量化的配置参数参考①:
nohup python -m vllm.entrypoints.openai.api_server \
--model /home/zhike/data/ModelScope/Qwen3-Coder-Next \
--trust-remote-code \
--port 8000 \
--host 100.102.39.213 \
--gpu-memory-utilization 0.92 \
--quantization fp8 \
--served-model-name Qwen/Qwen3-Coder-Next \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--max-model-len 65536 \ # ← 放弃 128K,改用 64K(更现实)
--enable-prefix-caching \
--max-num-seqs 8 \ # ← ≥4 并发,实际可达 8
--max-num-batched-tokens 65536 \
--block-size 32 \ # ← 提升至 32(Qwen 友好)
--max-logprobs 20 \
--disable-log-stats > ~/logs/vllm/vllm_$(date +%Y%m%d_%H%M%S).log 2>&1 &
量化nvfpv问题
转换代码:
import os
import ssl
# === 关键:强制使用官方 Hugging Face Hub ===
os.environ["HF_ENDPOINT"] = "https://huggingface.co"
# 如仍有 SSL 问题(极少见),再额外禁用验证(仅内网环境):
# os.environ["CURL_CA_BUNDLE"] = ""
# os.environ["REQUESTS_CA_BUNDLE"] = ""
# ssl._create_default_https_context = ssl._create_unverified_context
# === 正常导入 ===
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
# 1. 配置路径
MODEL_PATH = "/home/zhike/data/ModelScope/Qwen3-Coder-Next"
OUTPUT_PATH = "/home/zhike/data/ModelScope/Qwen3-Coder-Next-NVFP4"
# 2. 加载模型和分词器
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
# 3. 准备校准数据集 → 此处不再报错!
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
dataset = dataset.select(range(512))
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
dataset = dataset.map(preprocess)
def tokenize(sample):
return tokenizer(sample["text"], padding=False, max_length=2048, truncation=True, add_special_tokens=False)
dataset = dataset.map(tokenize, remove_columns=dataset.column_names)
# 4. 配置 NVFP4 量化方案
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])
# 5. 执行量化
oneshot(model=model, dataset=dataset, recipe=recipe, max_seq_length=2048)
# 6. 保存模型
model.save_pretrained(OUTPUT_PATH, save_compressed=True)
tokenizer.save_pretrained(OUTPUT_PATH)
print(f"✅ NVFP4 模型已保存至: {OUTPUT_PATH}")
报错:
(EngineCore pid=2958568) /bin/sh: 1: :/usr/local/cuda-13.0/bin/nvcc: not found
(EngineCore pid=2958568) ninja: build stopped: subcommand failed.
(EngineCore pid=2958568)
[rank0]:[W423 11:03:59.936941636 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2958399) Traceback (most recent call last):
(APIServer pid=2958399) File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=2958399) File "<frozen runpy>", line 88, in _run_code
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 710, in <module>
(APIServer pid=2958399) uvloop.run(run_server(args))
(APIServer pid=2958399) ~~~~~~~~~~^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=2958399) return __asyncio.run(
(APIServer pid=2958399) ~~~~~~~~~~~~~^
(APIServer pid=2958399) wrapper(),
(APIServer pid=2958399) ^^^^^^^^^^
(APIServer pid=2958399) ...<2 lines>...
(APIServer pid=2958399) **run_kwargs
(APIServer pid=2958399) ^^^^^^^^^^^^
(APIServer pid=2958399) )
(APIServer pid=2958399) ^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/asyncio/runners.py", line 195, in run
(APIServer pid=2958399) return runner.run(main)
(APIServer pid=2958399) ~~~~~~~~~~^^^^^^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/asyncio/runners.py", line 118, in run
(APIServer pid=2958399) return self._loop.run_until_complete(task)
(APIServer pid=2958399) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
(APIServer pid=2958399) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=2958399) return await main
(APIServer pid=2958399) ^^^^^^^^^^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=2958399) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=2958399) async with build_async_engine_client(
(APIServer pid=2958399) ~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399) args,
(APIServer pid=2958399) ^^^^^
(APIServer pid=2958399) client_config=client_config,
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) ) as engine_client:
(APIServer pid=2958399) ^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 214, in __aenter__
(APIServer pid=2958399) return await anext(self.gen)
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=2958399) async with build_async_engine_client_from_engine_args(
(APIServer pid=2958399) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399) engine_args,
(APIServer pid=2958399) ^^^^^^^^^^^^
(APIServer pid=2958399) usage_context=usage_context,
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) client_config=client_config,
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) ) as engine:
(APIServer pid=2958399) ^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 214, in __aenter__
(APIServer pid=2958399) return await anext(self.gen)
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=2958399) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2958399) vllm_config=vllm_config,
(APIServer pid=2958399) ...<6 lines>...
(APIServer pid=2958399) client_index=client_index,
(APIServer pid=2958399) )
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=2958399) return cls(
(APIServer pid=2958399) vllm_config=vllm_config,
(APIServer pid=2958399) ...<9 lines>...
(APIServer pid=2958399) client_index=client_index,
(APIServer pid=2958399) )
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=2958399) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2958399) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399) vllm_config=vllm_config,
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) ...<4 lines>...
(APIServer pid=2958399) client_index=client_index,
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) )
(APIServer pid=2958399) ^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=2958399) return func(*args, **kwargs)
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=2958399) return AsyncMPClient(*client_args)
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=2958399) return func(*args, **kwargs)
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=2958399) super().__init__(
(APIServer pid=2958399) ~~~~~~~~~~~~~~~~^
(APIServer pid=2958399) asyncio_mode=True,
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) ...<3 lines>...
(APIServer pid=2958399) client_addresses=client_addresses,
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) )
(APIServer pid=2958399) ^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=2958399) with launch_core_engines(
(APIServer pid=2958399) ~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399) vllm_config, executor_class, log_stats, addresses
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) ) as (engine_manager, coordinator, addresses, tensor_queue):
(APIServer pid=2958399) ^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/contextlib.py", line 148, in __exit__
(APIServer pid=2958399) next(self.gen)
(APIServer pid=2958399) ~~~~^^^^^^^^^^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=2958399) wait_for_engine_startup(
(APIServer pid=2958399) ~~~~~~~~~~~~~~~~~~~~~~~^
(APIServer pid=2958399) handshake_socket,
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) ...<6 lines>...
(APIServer pid=2958399) coordinator.proc if coordinator else None,
(APIServer pid=2958399) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2958399) )
(APIServer pid=2958399) ^
(APIServer pid=2958399) File "/home/zhike/.conda/envs/AI_env/lib/python3.13/site-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=2958399) raise RuntimeError(
(APIServer pid=2958399) ...<3 lines>...
(APIServer pid=2958399) )
(APIServer pid=2958399) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
执行: python -m vllm.entrypoints.openai.api_server --model /home/zhike/data/ModelScope/Qwen3-Coder-Next-NVFP4 --trust-remote-code --port 8000 --host 100.102.39.213 --gpu-memory-utilization 0.90 --served-model-name Qwen/Qwen3-Coder-Next --enable-auto-tool-choice --tool-call-parser qwen3_coder --max-model-len 32768 --enable-prefix-caching --max-num-seqs 16 --max-num-batched-tokens 131072 --block-size 64 --max-logprobs 20 --disable-log-stats --enforce-eage
本地查询:
(base) zhike@zhike:~/data$ which nvcc
/usr/local/cuda-13.0/bin/nvcc
(base) zhike@zhike:~/data$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jul_16_07:30:01_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.48
Build cuda_13.0.r13.0/compiler.36260728_0
(base) zhike@zhike:~/data$ nvidia-smi
Thu Apr 23 11:47:29 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.119.02 Driver Version: 580.119.02 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 6000 Blac... Off | 00000000:01:00.0 Off | Off |
| 30% 35C P8 19W / 600W | 91455MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3926 G /usr/lib/xorg/Xorg 75MiB |
| 0 N/A N/A 3014542 C VLLM::EngineCore 91352MiB |
+-----------------------------------------------------------------------------------------+
(base) zhike@zhike:~/data$ echo $PATH
/usr/local/cuda-13.0/bin:/home/zhike/.local/bin:/opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-13.0/bin
(base) zhike@zhike:~/data$ echo 'export CUDA_HOME=/usr/local/cuda-13.0'
export CUDA_HOME=/usr/local/cuda-13.0
解决:
出现找不到nvcc的情况
unset CUDA_HOME NVCC CUDA_PATH
# 2. 显式指定正确路径(注意不要带前导/后导冒号)
export CUDA_HOME=/usr/local/cuda-13.0
export NVCC=$CUDA_HOME/bin/nvcc
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
# 3. 验证 nvcc 是否可正常调用
$NVCC --version
rm -rf ~/.cache/flashinfer
# 若使用 VLLM 默认缓存目录,也可清理:
rm -rf ~/.cache/vllm
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐


所有评论(0)