【vllm】vllm 模型启动

执笔论英雄

365人浏览 · 2026-03-17 14:24:14

执笔论英雄 · 2026-03-17 14:24:14 发布

vLLM 启动和模型检查的完整流程

以下是 vLLM 从启动到发生 CUDA 错误的完整流程：

1. 引擎初始化阶段

vLLM 启动时，EngineCore 类被初始化 1 。这是 vLLM 引擎的核心组件，负责协调整个推理过程。

2. 模型架构检查

vLLM 需要确定要加载的模型架构。对于 Qwen3NextVLForConditionalGeneration 模型，vLLM 会调用 _try_inspect_model_cls() 函数来检查模型类 2 。

3. 子进程安全检查

为了安全地加载模型类（避免导入时的副作用），vLLM 在子进程中执行模型检查 3 。这是通过 _run_in_subprocess() 函数实现的。

4. 模型导入链

子进程尝试导入模型时，会触发以下导入链：

导入 qwen3_next_vl.py
该文件导入 qwen3_next.py 4
继续导入 mamba 相关组件
最终导入到 expert_parallel.py

5. CUDA 初始化触发

在 expert_parallel.py 的第 84 行，模块级别的代码尝试获取 CUDA 设备属性：

sm_major = torch.cuda.get_device_properties(dev).major

这会立即触发 CUDA 初始化，即使系统没有 CUDA GPU。

6. 错误发生

由于系统没有可用的 CUDA GPU，torch.cuda.get_device_properties() 抛出 RuntimeError: No CUDA GPUs are available 错误。

7. 错误传播

子进程中的错误被捕获并包装，通过 RuntimeError 重新抛出，包含子进程的完整错误信息 3 。

流程图

关键问题

这个流程暴露了一个设计问题：某些模型（特别是包含 MoE 组件的模型）在模块导入时就尝试访问 CUDA，而不是在实际使用时才访问。这导致即使在 CPU 环境下，仅仅导入模型定义就会失败。

Notes

这个错误流程特定于某些需要 GPU 信息的模型架构。不是所有 vLLM 模型都会在导入时触发 CUDA 初始化。错误发生在模型检查阶段，这是 vLLM 为了安全地确定模型能力而设计的重要机制。

Wiki pages you might want to explore:

Overview (vllm-project/vllm)

Citations

File: vllm/v1/engine/core.py (L85-160)

class EngineCore:
    """Inner loop of vLLM's Engine."""

    def __init__(
        self,
        vllm_config: VllmConfig,
        executor_class: type[Executor],
        log_stats: bool,
        executor_fail_callback: Callable | None = None,
        include_finished_set: bool = False,
    ):
        # plugins need to be loaded at the engine/scheduler level too
        from vllm.plugins import load_general_plugins

        load_general_plugins()

        self.vllm_config = vllm_config
        if not vllm_config.parallel_config.data_parallel_rank_local:
            logger.info(
                "Initializing a V1 LLM engine (v%s) with config: %s",
                VLLM_VERSION,
                vllm_config,
            )

        self.log_stats = log_stats

        # Setup Model.
        self.model_executor = executor_class(vllm_config)
        if executor_fail_callback is not None:
            self.model_executor.register_failure_callback(executor_fail_callback)

        self.available_gpu_memory_for_kv_cache = -1

        if envs.VLLM_ELASTIC_EP_SCALE_UP_LAUNCH:
            self._eep_scale_up_before_kv_init()

        # Setup KV Caches and update CacheConfig after profiling.
        num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
            vllm_config
        )
        if kv_cache_config.kv_cache_groups:
            vllm_config.cache_config.block_size = min(
                g.kv_cache_spec.block_size for g in kv_cache_config.kv_cache_groups
            )
        vllm_config.validate_block_size()
        vllm_config.cache_config.num_gpu_blocks = num_gpu_blocks
        vllm_config.cache_config.num_cpu_blocks = num_cpu_blocks
        self.collective_rpc("initialize_cache", args=(num_gpu_blocks, num_cpu_blocks))

        self.structured_output_manager = StructuredOutputManager(vllm_config)

        # Setup scheduler.
        Scheduler = vllm_config.scheduler_config.get_scheduler_cls()

        if len(kv_cache_config.kv_cache_groups) == 0:  # noqa: SIM102
            # Encoder models without KV cache don't support
            # chunked prefill. But do SSM models?
            if vllm_config.scheduler_config.enable_chunked_prefill:
                logger.warning("Disabling chunked prefill for model without KVCache")
                vllm_config.scheduler_config.enable_chunked_prefill = False

        scheduler_block_size = (
            vllm_config.cache_config.block_size
            * vllm_config.parallel_config.decode_context_parallel_size
            * vllm_config.parallel_config.prefill_context_parallel_size
        )

        self.scheduler: SchedulerInterface = Scheduler(
            vllm_config=vllm_config,
            kv_cache_config=kv_cache_config,
            structured_output_manager=self.structured_output_manager,
            include_finished_set=include_finished_set,
            log_stats=self.log_stats,
            block_size=scheduler_block_size,
        )
        self.use_spec_decode = vllm_config.speculative_config is not None

子进程导入模型的详细流程

为什么使用子进程

vLLM 使用子进程导入模型主要是为了隔离导入时的副作用。某些模型在导入时会：

初始化 CUDA
加载大量库
执行全局代码

在子进程中导入可以避免这些副作用影响主进程，特别是防止 CUDA 重复初始化错误 1 。

子进程创建和执行流程

1. 模型检查触发

当 vLLM 需要检查 Qwen3NextVLForConditionalGeneration 模型时，会调用 _try_inspect_model_cls() 2 。

2. 子进程创建

_run_in_subprocess() 函数负责创建子进程：

def _run_in_subprocess(fn: Callable[[], _T]) -> _T:
    # 创建临时目录存储结果
    with tempfile.TemporaryDirectory() as tempdir:
        output_filepath = os.path.join(tempdir, "registry_output.tmp")
        
        # 使用 cloudpickle 序列化函数和参数
        import cloudpickle
        input_bytes = cloudpickle.dumps((fn, output_filepath))
        
        # 运行子进程：python -m vllm.model_executor.models.registry
        returned = subprocess.run(
            _SUBPROCESS_COMMAND, input=input_bytes, capture_output=True
        )

其中 _SUBPROCESS_COMMAND = [sys.executable, "-m", "vllm.model_executor.models.registry"] 3 。

3. 子进程中的模型导入

子进程执行 _run() 函数 4 ：

def _run() -> None:
    # 加载插件
    from vllm.plugins import load_general_plugins
    load_general_plugins()
    
    # 反序列化要执行的函数
    fn, output_file = pickle.loads(sys.stdin.buffer.read())
    
    # 执行函数（这里是模型检查）
    result = fn()
    
    # 保存结果
    with open(output_file, "wb") as f:
        f.write(pickle.dumps(result))

4. 具体的导入链

子进程中执行的函数是：

lambda: _ModelInfo.from_model_cls(self.load_model_cls())

这会触发以下导入链：

load_model_cls() 调用 importlib.import_module(self.module_name) 5
导入 vllm.model_executor.models.qwen3_next_vl
qwen3_next_vl.py 第48行导入 qwen3_next 模块
qwen3_next.py 第65行导入 mamba 组件 6
最终导入 expert_parallel.py，在第84行触发 CUDA 初始化

错误处理和传播

如果子进程失败，错误会被捕获并包装：

try:
    returned.check_returncode()
except Exception as e:
    raise RuntimeError(
        f"Error raised in subprocess:\n{returned.stderr.decode()}"
    ) from e

这样主进程就能获得完整的错误信息 7 。

Notes

使用子进程的设计虽然增加了复杂性，但是必要的。它允许 vLLM 安全地探索模型能力，而不会因为某个模型的导入问题导致整个系统崩溃。这种隔离机制是 vLLM 支持数百种不同模型架构的关键设计之一。

Wiki pages you might want to explore:

Model Registration (vllm-project/vllm)

Wiki pages you might want to explore:

Overview (vllm-project/vllm)

Citations

File: vllm/model_executor/models/registry.py (L620-620)

_SUBPROCESS_COMMAND = [sys.executable, "-m", "vllm.model_executor.models.registry"]

File: vllm/model_executor/models/registry.py (L830-832)

    def load_model_cls(self) -> type[nn.Module]:
        mod = importlib.import_module(self.module_name)
        return getattr(mod, self.class_name)

File: vllm/model_executor/models/registry.py (L851-859)

def _try_inspect_model_cls(
    model_arch: str,
    model: _BaseRegisteredModel,
) -> _ModelInfo | None:
    try:
        return model.inspect_model_cls()
    except Exception:
        logger.exception("Error in inspecting model architecture '%s'", model_arch)
        return None

File: vllm/model_executor/models/registry.py (L1263-1290)

def _run_in_subprocess(fn: Callable[[], _T]) -> _T:
    # NOTE: We use a temporary directory instead of a temporary file to avoid
    # issues like https://stackoverflow.com/questions/23212435/permission-denied-to-write-to-my-temporary-file
    with tempfile.TemporaryDirectory() as tempdir:
        output_filepath = os.path.join(tempdir, "registry_output.tmp")

        # `cloudpickle` allows pickling lambda functions directly
        import cloudpickle

        input_bytes = cloudpickle.dumps((fn, output_filepath))

        # cannot use `sys.executable __file__` here because the script
        # contains relative imports
        returned = subprocess.run(
            _SUBPROCESS_COMMAND, input=input_bytes, capture_output=True
        )

        # check if the subprocess is successful
        try:
            returned.check_returncode()
        except Exception as e:
            # wrap raised exception to provide more information
            raise RuntimeError(
                f"Error raised in subprocess:\n{returned.stderr.decode()}"
            ) from e

        with open(output_filepath, "rb") as f:
            return pickle.load(f)

File: vllm/model_executor/models/registry.py (L1293-1308)

def _run() -> None:
    # Setup plugins
    from vllm.plugins import load_general_plugins

    load_general_plugins()

    fn, output_file = pickle.loads(sys.stdin.buffer.read())

    result = fn()

    with open(output_file, "wb") as f:
        f.write(pickle.dumps(result))


if __name__ == "__main__":
    _run()

导入模型的完整过程是什么？什么时候创建的子进程， inspect_model_cls 怎么调用到这里的，给出时序流程图。

vLLM 模型导入完整流程

时序流程图

详细流程说明

1. 模型注册阶段

vLLM 启动时，ModelRegistry 会为每个支持的模型架构创建 _LazyRegisteredModel 实例 1 。这些实例包含模块名和类名，但不会立即导入。

2. 模型检查触发

当 vLLM 需要了解模型能力时（如启动时、配置验证时），会调用 _try_inspect_model_cls() 2 。

3. 子进程创建

_LazyRegisteredModel.inspect_model_cls() 方法首先检查缓存 3 。如果缓存未命中，会调用 _run_in_subprocess() 4 。

子进程通过以下命令创建：

_SUBPROCESS_COMMAND = [sys.executable, "-m", "vllm.model_executor.models.registry"]
```[5](#4-4) 

#### 4. 子进程执行

子进程执行 `_run()` 函数 [6](#4-5) ：
1. 加载插件
2. 反序列化要执行的函数
3. 执行函数（模型检查）
4. 保存结果到临时文件

#### 5. 模型导入链

子进程中执行的函数是：
```python
lambda: _ModelInfo.from_model_cls(self.load_model_cls())

load_model_cls() 方法使用 importlib.import_module() 导入模块 7 ，触发导入链：

vllm.model_executor.models.qwen3_next_vl
qwen3_next.py 导入 mamba 组件 8
expert_parallel.py 在模块级别调用 CUDA 函数

6. 错误处理

如果子进程失败，错误会被包装并传播到主进程 9 。

关键设计决策

延迟加载：模型只在需要

Wiki pages you might want to explore:

Overview (vllm-project/vllm)

Citations

File: vllm/model_executor/models/registry.py (L620-620)

_SUBPROCESS_COMMAND = [sys.executable, "-m", "vllm.model_executor.models.registry"]

File: vllm/model_executor/models/registry.py (L793-828)

    def inspect_model_cls(self) -> _ModelInfo:
        model_path = Path(__file__).parent / f"{self.module_name.split('.')[-1]}.py"
        module_hash = None

        if model_path.exists():
            with open(model_path, "rb") as f:
                module_hash = safe_hash(f.read(), usedforsecurity=False).hexdigest()

            mi = self._load_modelinfo_from_cache(module_hash)
            if mi is not None:
                logger.debug(
                    "Loaded model info for class %s.%s from cache",
                    self.module_name,
                    self.class_name,
                )
                return mi
            else:
                logger.debug(
                    "Cache model info for class %s.%s miss. Loading model instead.",
                    self.module_name,
                    self.class_name,
                )

        # Performed in another process to avoid initializing CUDA
        mi = _run_in_subprocess(
            lambda: _ModelInfo.from_model_cls(self.load_model_cls())
        )
        logger.debug(
            "Loaded model info for class %s.%s", self.module_name, self.class_name
        )

        # save cache file
        if module_hash is not None:
            self._save_modelinfo_to_cache(mi, module_hash)

        return mi

File: vllm/model_executor/models/registry.py (L830-832)

    def load_model_cls(self) -> type[nn.Module]:
        mod = importlib.import_module(self.module_name)
        return getattr(mod, self.class_name)

File: vllm/model_executor/models/registry.py (L851-859)

def _try_inspect_model_cls(
    model_arch: str,
    model: _BaseRegisteredModel,
) -> _ModelInfo | None:
    try:
        return model.inspect_model_cls()
    except Exception:
        logger.exception("Error in inspecting model architecture '%s'", model_arch)
        return None

File: vllm/model_executor/models/registry.py (L1250-1258)

ModelRegistry = _ModelRegistry(
    {
        model_arch: _LazyRegisteredModel(
            module_name=f"vllm.model_executor.models.{mod_relname}",
            class_name=cls_name,
        )
        for model_arch, (mod_relname, cls_name) in _VLLM_MODELS.items()
    }
)

File: vllm/model_executor/models/registry.py (L1281-1287)

        try:
            returned.check_returncode()
        except Exception as e:
            # wrap raised exception to provide more information
            raise RuntimeError(
                f"Error raised in subprocess:\n{returned.stderr.decode()}"
            ) from e

File: vllm/model_executor/models/registry.py (L1293-1308)

def _run() -> None:
    # Setup plugins
    from vllm.plugins import load_general_plugins

    load_general_plugins()

    fn, output_file = pickle.loads(sys.stdin.buffer.read())

    result = fn()

    with open(output_file, "wb") as f:
        f.write(pickle.dumps(result))


if __name__ == "__main__":
    _run()