【vllm】vllm 模型启动
vLLM 启动和模型检查的完整流程
以下是 vLLM 从启动到发生 CUDA 错误的完整流程:
1. 引擎初始化阶段
vLLM 启动时,EngineCore 类被初始化 1 。这是 vLLM 引擎的核心组件,负责协调整个推理过程。
2. 模型架构检查
vLLM 需要确定要加载的模型架构。对于 Qwen3NextVLForConditionalGeneration 模型,vLLM 会调用 _try_inspect_model_cls() 函数来检查模型类 2 。
3. 子进程安全检查
为了安全地加载模型类(避免导入时的副作用),vLLM 在子进程中执行模型检查 3 。这是通过 _run_in_subprocess() 函数实现的。
4. 模型导入链
子进程尝试导入模型时,会触发以下导入链:
- 导入
qwen3_next_vl.py - 该文件导入
qwen3_next.py4 - 继续导入 mamba 相关组件
- 最终导入到
expert_parallel.py
5. CUDA 初始化触发
在 expert_parallel.py 的第 84 行,模块级别的代码尝试获取 CUDA 设备属性:
sm_major = torch.cuda.get_device_properties(dev).major
这会立即触发 CUDA 初始化,即使系统没有 CUDA GPU。
6. 错误发生
由于系统没有可用的 CUDA GPU,torch.cuda.get_device_properties() 抛出 RuntimeError: No CUDA GPUs are available 错误。
7. 错误传播
子进程中的错误被捕获并包装,通过 RuntimeError 重新抛出,包含子进程的完整错误信息 3 。
流程图
关键问题
这个流程暴露了一个设计问题:某些模型(特别是包含 MoE 组件的模型)在模块导入时就尝试访问 CUDA,而不是在实际使用时才访问。这导致即使在 CPU 环境下,仅仅导入模型定义就会失败。
Notes
这个错误流程特定于某些需要 GPU 信息的模型架构。不是所有 vLLM 模型都会在导入时触发 CUDA 初始化。错误发生在模型检查阶段,这是 vLLM 为了安全地确定模型能力而设计的重要机制。
Wiki pages you might want to explore:
Citations
File: vllm/v1/engine/core.py (L85-160)
class EngineCore:
"""Inner loop of vLLM's Engine."""
def __init__(
self,
vllm_config: VllmConfig,
executor_class: type[Executor],
log_stats: bool,
executor_fail_callback: Callable | None = None,
include_finished_set: bool = False,
):
# plugins need to be loaded at the engine/scheduler level too
from vllm.plugins import load_general_plugins
load_general_plugins()
self.vllm_config = vllm_config
if not vllm_config.parallel_config.data_parallel_rank_local:
logger.info(
"Initializing a V1 LLM engine (v%s) with config: %s",
VLLM_VERSION,
vllm_config,
)
self.log_stats = log_stats
# Setup Model.
self.model_executor = executor_class(vllm_config)
if executor_fail_callback is not None:
self.model_executor.register_failure_callback(executor_fail_callback)
self.available_gpu_memory_for_kv_cache = -1
if envs.VLLM_ELASTIC_EP_SCALE_UP_LAUNCH:
self._eep_scale_up_before_kv_init()
# Setup KV Caches and update CacheConfig after profiling.
num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
vllm_config
)
if kv_cache_config.kv_cache_groups:
vllm_config.cache_config.block_size = min(
g.kv_cache_spec.block_size for g in kv_cache_config.kv_cache_groups
)
vllm_config.validate_block_size()
vllm_config.cache_config.num_gpu_blocks = num_gpu_blocks
vllm_config.cache_config.num_cpu_blocks = num_cpu_blocks
self.collective_rpc("initialize_cache", args=(num_gpu_blocks, num_cpu_blocks))
self.structured_output_manager = StructuredOutputManager(vllm_config)
# Setup scheduler.
Scheduler = vllm_config.scheduler_config.get_scheduler_cls()
if len(kv_cache_config.kv_cache_groups) == 0: # noqa: SIM102
# Encoder models without KV cache don't support
# chunked prefill. But do SSM models?
if vllm_config.scheduler_config.enable_chunked_prefill:
logger.warning("Disabling chunked prefill for model without KVCache")
vllm_config.scheduler_config.enable_chunked_prefill = False
scheduler_block_size = (
vllm_config.cache_config.block_size
* vllm_config.parallel_config.decode_context_parallel_size
* vllm_config.parallel_config.prefill_context_parallel_size
)
self.scheduler: SchedulerInterface = Scheduler(
vllm_config=vllm_config,
kv_cache_config=kv_cache_config,
structured_output_manager=self.structured_output_manager,
include_finished_set=include_finished_set,
log_stats=self.log_stats,
block_size=scheduler_block_size,
)
self.use_spec_decode = vllm_config.speculative_config is not None
子进程导入模型的详细流程
为什么使用子进程
vLLM 使用子进程导入模型主要是为了隔离导入时的副作用。某些模型在导入时会:
- 初始化 CUDA
- 加载大量库
- 执行全局代码
在子进程中导入可以避免这些副作用影响主进程,特别是防止 CUDA 重复初始化错误 1 。
子进程创建和执行流程
1. 模型检查触发
当 vLLM 需要检查 Qwen3NextVLForConditionalGeneration 模型时,会调用 _try_inspect_model_cls() 2 。
2. 子进程创建
_run_in_subprocess() 函数负责创建子进程:
def _run_in_subprocess(fn: Callable[[], _T]) -> _T:
# 创建临时目录存储结果
with tempfile.TemporaryDirectory() as tempdir:
output_filepath = os.path.join(tempdir, "registry_output.tmp")
# 使用 cloudpickle 序列化函数和参数
import cloudpickle
input_bytes = cloudpickle.dumps((fn, output_filepath))
# 运行子进程:python -m vllm.model_executor.models.registry
returned = subprocess.run(
_SUBPROCESS_COMMAND, input=input_bytes, capture_output=True
)
其中 _SUBPROCESS_COMMAND = [sys.executable, "-m", "vllm.model_executor.models.registry"] 3 。
3. 子进程中的模型导入
子进程执行 _run() 函数 4 :
def _run() -> None:
# 加载插件
from vllm.plugins import load_general_plugins
load_general_plugins()
# 反序列化要执行的函数
fn, output_file = pickle.loads(sys.stdin.buffer.read())
# 执行函数(这里是模型检查)
result = fn()
# 保存结果
with open(output_file, "wb") as f:
f.write(pickle.dumps(result))
4. 具体的导入链
子进程中执行的函数是:
lambda: _ModelInfo.from_model_cls(self.load_model_cls())
这会触发以下导入链:
- load_model_cls() 调用
importlib.import_module(self.module_name)5 - 导入
vllm.model_executor.models.qwen3_next_vl qwen3_next_vl.py第48行导入qwen3_next模块qwen3_next.py第65行导入 mamba 组件 6- 最终导入
expert_parallel.py,在第84行触发 CUDA 初始化
错误处理和传播
如果子进程失败,错误会被捕获并包装:
try:
returned.check_returncode()
except Exception as e:
raise RuntimeError(
f"Error raised in subprocess:\n{returned.stderr.decode()}"
) from e
这样主进程就能获得完整的错误信息 7 。
Notes
使用子进程的设计虽然增加了复杂性,但是必要的。它允许 vLLM 安全地探索模型能力,而不会因为某个模型的导入问题导致整个系统崩溃。这种隔离机制是 vLLM 支持数百种不同模型架构的关键设计之一。
Wiki pages you might want to explore:
Wiki pages you might want to explore:
Citations
File: vllm/model_executor/models/registry.py (L620-620)
_SUBPROCESS_COMMAND = [sys.executable, "-m", "vllm.model_executor.models.registry"]
File: vllm/model_executor/models/registry.py (L830-832)
def load_model_cls(self) -> type[nn.Module]:
mod = importlib.import_module(self.module_name)
return getattr(mod, self.class_name)
File: vllm/model_executor/models/registry.py (L851-859)
def _try_inspect_model_cls(
model_arch: str,
model: _BaseRegisteredModel,
) -> _ModelInfo | None:
try:
return model.inspect_model_cls()
except Exception:
logger.exception("Error in inspecting model architecture '%s'", model_arch)
return None
File: vllm/model_executor/models/registry.py (L1263-1290)
def _run_in_subprocess(fn: Callable[[], _T]) -> _T:
# NOTE: We use a temporary directory instead of a temporary file to avoid
# issues like https://stackoverflow.com/questions/23212435/permission-denied-to-write-to-my-temporary-file
with tempfile.TemporaryDirectory() as tempdir:
output_filepath = os.path.join(tempdir, "registry_output.tmp")
# `cloudpickle` allows pickling lambda functions directly
import cloudpickle
input_bytes = cloudpickle.dumps((fn, output_filepath))
# cannot use `sys.executable __file__` here because the script
# contains relative imports
returned = subprocess.run(
_SUBPROCESS_COMMAND, input=input_bytes, capture_output=True
)
# check if the subprocess is successful
try:
returned.check_returncode()
except Exception as e:
# wrap raised exception to provide more information
raise RuntimeError(
f"Error raised in subprocess:\n{returned.stderr.decode()}"
) from e
with open(output_filepath, "rb") as f:
return pickle.load(f)
File: vllm/model_executor/models/registry.py (L1293-1308)
def _run() -> None:
# Setup plugins
from vllm.plugins import load_general_plugins
load_general_plugins()
fn, output_file = pickle.loads(sys.stdin.buffer.read())
result = fn()
with open(output_file, "wb") as f:
f.write(pickle.dumps(result))
if __name__ == "__main__":
_run()
导入模型的完整过程是什么? 什么时候创建的子进程, inspect_model_cls 怎么调用到这里的, 给出时序流程图。
vLLM 模型导入完整流程
时序流程图
详细流程说明
1. 模型注册阶段
vLLM 启动时,ModelRegistry 会为每个支持的模型架构创建 _LazyRegisteredModel 实例 1 。这些实例包含模块名和类名,但不会立即导入。
2. 模型检查触发
当 vLLM 需要了解模型能力时(如启动时、配置验证时),会调用 _try_inspect_model_cls() 2 。
3. 子进程创建
_LazyRegisteredModel.inspect_model_cls() 方法首先检查缓存 3 。如果缓存未命中,会调用 _run_in_subprocess() 4 。
子进程通过以下命令创建:
_SUBPROCESS_COMMAND = [sys.executable, "-m", "vllm.model_executor.models.registry"]
```[5](#4-4)
#### 4. 子进程执行
子进程执行 `_run()` 函数 [6](#4-5) :
1. 加载插件
2. 反序列化要执行的函数
3. 执行函数(模型检查)
4. 保存结果到临时文件
#### 5. 模型导入链
子进程中执行的函数是:
```python
lambda: _ModelInfo.from_model_cls(self.load_model_cls())
load_model_cls() 方法使用 importlib.import_module() 导入模块 7 ,触发导入链:
vllm.model_executor.models.qwen3_next_vlqwen3_next.py导入 mamba 组件 8expert_parallel.py在模块级别调用 CUDA 函数
6. 错误处理
如果子进程失败,错误会被包装并传播到主进程 9 。
关键设计决策
- 延迟加载:模型只在需要
Wiki pages you might want to explore:
Citations
File: vllm/model_executor/models/registry.py (L620-620)
_SUBPROCESS_COMMAND = [sys.executable, "-m", "vllm.model_executor.models.registry"]
File: vllm/model_executor/models/registry.py (L793-828)
def inspect_model_cls(self) -> _ModelInfo:
model_path = Path(__file__).parent / f"{self.module_name.split('.')[-1]}.py"
module_hash = None
if model_path.exists():
with open(model_path, "rb") as f:
module_hash = safe_hash(f.read(), usedforsecurity=False).hexdigest()
mi = self._load_modelinfo_from_cache(module_hash)
if mi is not None:
logger.debug(
"Loaded model info for class %s.%s from cache",
self.module_name,
self.class_name,
)
return mi
else:
logger.debug(
"Cache model info for class %s.%s miss. Loading model instead.",
self.module_name,
self.class_name,
)
# Performed in another process to avoid initializing CUDA
mi = _run_in_subprocess(
lambda: _ModelInfo.from_model_cls(self.load_model_cls())
)
logger.debug(
"Loaded model info for class %s.%s", self.module_name, self.class_name
)
# save cache file
if module_hash is not None:
self._save_modelinfo_to_cache(mi, module_hash)
return mi
File: vllm/model_executor/models/registry.py (L830-832)
def load_model_cls(self) -> type[nn.Module]:
mod = importlib.import_module(self.module_name)
return getattr(mod, self.class_name)
File: vllm/model_executor/models/registry.py (L851-859)
def _try_inspect_model_cls(
model_arch: str,
model: _BaseRegisteredModel,
) -> _ModelInfo | None:
try:
return model.inspect_model_cls()
except Exception:
logger.exception("Error in inspecting model architecture '%s'", model_arch)
return None
File: vllm/model_executor/models/registry.py (L1250-1258)
ModelRegistry = _ModelRegistry(
{
model_arch: _LazyRegisteredModel(
module_name=f"vllm.model_executor.models.{mod_relname}",
class_name=cls_name,
)
for model_arch, (mod_relname, cls_name) in _VLLM_MODELS.items()
}
)
File: vllm/model_executor/models/registry.py (L1281-1287)
try:
returned.check_returncode()
except Exception as e:
# wrap raised exception to provide more information
raise RuntimeError(
f"Error raised in subprocess:\n{returned.stderr.decode()}"
) from e
File: vllm/model_executor/models/registry.py (L1293-1308)
def _run() -> None:
# Setup plugins
from vllm.plugins import load_general_plugins
load_general_plugins()
fn, output_file = pickle.loads(sys.stdin.buffer.read())
result = fn()
with open(output_file, "wb") as f:
f.write(pickle.dumps(result))
if __name__ == "__main__":
_run()
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐



所有评论(0)