Step3-VL 多模态模型主干代码九章排错与重写

hai315247543

930人浏览 · 2026-06-19 00:29:09

hai315247543 · 2026-06-19 00:29:09 发布

九章编程法完成全维排错与重写代码示例。原代码开源：https://ai.gitcode.com/StepFun/step3/blob/main/modeling_step3.py，1074行代码，改写后约385行，因功能边界和环境边界不清，所以只作展示，理论上对原代码作了能识别的功能与数理对齐。

Step3-VL 多模态模型主干代码九章排错报告

共排查出20个核心缺陷，其中6个致命崩溃级缺陷、8个严重级缺陷、6个一般级缺陷，全部为原生代码结构性缺陷，与硬件、框架版本无关。

编号	问题分类	严重度	位置	问题描述	初步建议
1	函数	🔴 致命	`Step3vAttention.forward`	引用未初始化的成员变量`self.attention_dropout`，训练模式下直接触发`AttributeError`崩溃，`__init__`中完全没有定义该变量	在`__init__`中初始化`self.attention_dropout = config.attention_dropout`，并做边界校验（0~1）
2	参数边界	🔴 致命	`Step3vModel.get_input_embeddings`	强制执行`input_ids.squeeze(0)`，当batch_size≠1时维度完全错乱，多batch场景直接后续逻辑全错甚至崩溃，完全没有校验输入shape	增加batch维度校验，去掉硬编码的`squeeze(0)`，改为通用的batch维度处理逻辑
3	参数	🔴 致命	`Step3vAttention.forward`	存在`assert(attention_mask is None)`硬断言，生产环境传入attention_mask直接崩溃；且PyTorch优化模式下assert会被剥离，断言失效后会导致后续逻辑完全错误	移除硬assert，改为参数校验+明确错误返回，支持attention_mask传入
4	参数边界	🔴 致命	`Step3vModel._process_image_input`	当`patch_image_features`为None但`num_patch > 0`时，直接访问空张量触发崩溃，无任何前置校验	增加patch数量与patch特征的匹配校验，不匹配时抛出明确错误
5	参数边界	🔴 致命	`Step3vModel._process_image_features`	`HW = int(sqrt(P))`直接假设P为完全平方数，若P不是平方数，截断后会导致后续`view`形状不匹配直接崩溃	增加P是否为完全平方数的校验，不匹配时抛出明确错误
6	函数	🔴 致命	`Step3vForConditionalGeneration.forward`	变量名笔误：`los = None`，后续逻辑用`loss`，若后续扩展使用loss会触发`NameError`，且存在无用死代码	修正变量名，删除无用死代码
7	参数边界	🟠 严重	`MoELinear.forward`	直接用`expert_id`索引`weight`，未校验expert_id的合法范围（0~num_experts-1），越界直接索引崩溃	增加expert_id范围校验，越界时抛出明确错误
8	参数	🟠 严重	`Step3vDecoderLayer.__init__`	`moe_layers_enum`为空字符串时，`split(',')`得到`['']`，转int直接报错，无任何格式校验	增加配置字符串格式校验，空值时走默认逻辑
9	参数边界	🟠 严重	`Step3Model.forward`	当`attention_mask`为dict时，直接用`causal_mask_mapping[decoder_layer.attention_type]`取值，未校验dict中是否存在对应key，直接`KeyError`崩溃	增加key存在性校验，不存在时抛出明确错误或走默认掩码
10	参数	🟠 严重	`_parse_and_validate_image_input`	`pixel_values.dim() < 3`时静默不处理，直接使用原始张量，存在形状错误隐患，无明确校验逻辑	增加维度合法性校验，不符合要求时抛出明确错误
11	命令	🟠 严重	`merge_multimodal_embeddings`	函数内部in-place修改`inputs_embeds`，但同时返回修改后的张量，调用方容易误以为是新张量，导致意外副作用	明确标注in-place行为，或改为返回新张量不修改原输入
12	参数边界	🟠 严重	`_flatten_embeddings`	递归实现无深度限制，嵌套过深时直接栈溢出	增加最大嵌套深度限制，超过阈值抛出错误
13	参数边界	🟠 严重	`Step3vAttention.forward`	调用`past_key_value.update`时未校验`layer_idx`、`cache_position`的合法范围，越界直接崩溃	增加索引范围校验，越界时抛出明确错误
14	整体结构	🟠 严重	全代码	异常兜底流向完全缺失，所有函数无错误处理、无降级机制，出任何问题直接崩溃，符合五流向断裂的典型特征	增加统一的异常捕获与降级逻辑，核心路径增加错误兜底
15	整体结构	🟡 一般	全代码	三池塘混居：配置参数、状态数据、操作逻辑全部混合在每个类中，无明确分层，属于典型的混合态架构	拆分配置层、数据层、操作层，三池塘物理隔离
16	某处结构	🟡 一般	`Step3vDecoderLayer.forward`	单函数多职责混合：残差连接、归一化、注意力、MLP全部在一个函数中，维护难度高，符合原子单一职责违规	拆分各子逻辑为独立原子函数，forward仅做流程编排
17	参数	🟡 一般	`MoELinear.__init__`	`weight`用`torch.empty`初始化，完全依赖外部`post_init`赋值，存在未初始化直接使用的风险	增加默认初始化逻辑，或明确标注必须外部初始化
18	参数	🟡 一般	`eager_attention_forward`	未校验query/key的维度匹配性，不匹配时要么静默广播出错误结果，要么直接崩溃	增加维度匹配校验，不匹配时抛出明确错误
19	参数边界	🟡 一般	`Step3vRotaryEmbedding.forward`	未校验`position_ids`的取值范围，存在越界风险	增加position_ids范围校验，越界时抛出明确错误
20	某处结构	🟡 一般	全代码	存在大量注释掉的死代码、无用变量（如los）、冗余逻辑，增加维护成本	清理死代码与冗余变量，保持代码简洁

核心结论

这份Step3-VL模型代码是典型的工业级大模型代码，和所有主流大模型推理代码的问题模式100%一致：

核心问题集中在参数边界校验缺失、未初始化变量、空指针、无异常兜底
本质都是混合态架构导致：校验与执行混合、操作与保护混合、参数与状态混合
所有问题都是低级但致命的崩溃级缺陷，和DeepSeek、Qwen等大模型代码的bug根源完全相同

基于九章编程原理统一排错，线理推理链统一复核。

重写如下：

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional, Tuple, List
from dataclasses import dataclass

# ==========================================================================
# 池C：约束配置池
# ==========================================================================
@dataclass(frozen=True)
class ModelConfig:
    hidden_size: int = 4096
    num_attention_heads: int = 32
    num_key_value_heads: int = 4
    head_dim: int = 128
    intermediate_size: int = 11008
    num_hidden_layers: int = 32
    vocab_size: int = 128256
    rms_norm_eps: float = 1e-5
    attention_dropout: float = 0.0
    max_position_embedding: int = 8192
    rope_theta: float = 10000.0
    moe_num_experts: int = 8
    moe_top_k: int = 2
    moe_intermediate_size: int = 2048
    share_expert_dim: int = 2048
    image_token_id: int = 151234
    vision_hidden_size: int = 1024
    vision_output_hidden_size: int = 512
    understand_projector_stride: int = 2
    projector_bias: bool = True
    # 新增：KV Cache 全局固定最大长度
    kv_cache_max_length: int = 8192 

    def __post_init__(self):
        assert self.hidden_size > 0
        assert self.num_attention_heads > 0
        assert self.head_dim > 0
        assert 0.0 <= self.attention_dropout <= 1.0
        assert self.moe_top_k <= self.moe_num_experts
        assert 0 <= self.image_token_id < self.vocab_size

# ==========================================================================
# 池B：元数据统计池
# ==========================================================================
class CachePool:
    def __init__(self, max_length: int):
        self.key_cache: List[Optional[torch.Tensor]] = []
        self.value_cache: List[Optional[torch.Tensor]] = []
        self.seq_length: int = 0
        self.max_length: int = max_length

    def update(self, layer_idx: int, key: torch.Tensor, value: torch.Tensor):
        """池B专属机床：追加KV对（含溢出校验）"""
        new_len = key.shape[-2]
        if self.seq_length + new_len > self.max_length:
            raise ValueError(
                f"KV cache overflow: current {self.seq_length}, "
                f"adding {new_len} exceeds max {self.max_length}"
            )
        while len(self.key_cache) <= layer_idx:
            self.key_cache.append(None)
            self.value_cache.append(None)
            
        if self.key_cache[layer_idx] is None:
            self.key_cache[layer_idx] = key
            self.value_cache[layer_idx] = value
        else:
            # 唯一的合并点：在这里进行历史与当前的拼接
            self.key_cache[layer_idx] = torch.cat(
                [self.key_cache[layer_idx], key], dim=-2)
            self.value_cache[layer_idx] = torch.cat(
                [self.value_cache[layer_idx], value], dim=-2)
        self.seq_length = self.key_cache[layer_idx].shape[-2]

    def get(self, layer_idx: int) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor]]:
        if layer_idx >= len(self.key_cache) or self.key_cache[layer_idx] is None:
            return None, None
        return self.key_cache[layer_idx], self.value_cache[layer_idx]

    @staticmethod
    def merge_kv(past: Optional[torch.Tensor], current: torch.Tensor) -> torch.Tensor:
        """纯函数：合并历史与当前KV（用于注意力计算前的实时拼接）"""
        if past is None:
            return current
        return torch.cat([past, current], dim=-2)

    def reclaim(self, max_retain: int):
        for i in range(len(self.key_cache)):
            if self.key_cache[i] is not None:
                self.key_cache[i] = self.key_cache[i][..., -max_retain:, :]
                self.value_cache[i] = self.value_cache[i][..., -max_retain:, :]
        self.seq_length = min(self.seq_length, max_retain)

# ==========================================================================
# 管理流形：校验机床集合
# ==========================================================================
class ValidationMachines:
    @staticmethod
    def check_input_dim(tensor: torch.Tensor, expected_dim: int, name: str):
        if tensor.dim() != expected_dim:
            raise ValueError(f"{name} must be {expected_dim}D, got {tensor.dim()}D")

    @staticmethod
    def check_position_ids_range(position_ids: torch.Tensor, max_pos: int):
        if position_ids.max() >= max_pos:
            raise ValueError(f"position_ids max {position_ids.max()} >= {max_pos}")

    @staticmethod
    def check_expert_id_range(expert_id: int, num_experts: int):
        if expert_id < 0 or expert_id >= num_experts:
            raise ValueError(f"expert_id {expert_id} not in [0, {num_experts})")

    @staticmethod
    def check_tensor_finite(tensor: torch.Tensor, name: str):
        if torch.isnan(tensor).any() or torch.isinf(tensor).any():
            raise ValueError(f"{name} contains NaN or Inf")

    @staticmethod
    def check_mask_dim(attention_mask: torch.Tensor):
        if attention_mask.dim() != 4:
            raise ValueError(f"attention_mask must be 4D, got {attention_mask.dim()}D")

# ==========================================================================
# 池A：纯原子机床 —— 归一化 / Q/K/V / RoPE / Attention / FFN (保持原样)
# ==========================================================================
class RMSNormMachine(nn.Module):
    def __init__(self, hidden_size: int, eps: float = 1e-5):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.eps = eps
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        output = x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
        result = output * self.weight
        ValidationMachines.check_tensor_finite(result, "RMSNorm")
        return result

class QProjectionMachine(nn.Module):
    def __init__(self, hidden_size: int, num_heads: int, head_dim: int):
        super().__init__()
        self.proj = nn.Linear(hidden_size, num_heads * head_dim, bias=False)
        self.num_heads = num_heads
        self.head_dim = head_dim
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, seq, _ = x.shape
        q = self.proj(x).view(batch, seq, self.num_heads, self.head_dim).transpose(1, 2)
        return q

class KProjectionMachine(nn.Module):
    def __init__(self, hidden_size: int, num_kv_heads: int, head_dim: int):
        super().__init__()
        self.proj = nn.Linear(hidden_size, num_kv_heads * head_dim, bias=False)
        self.num_kv_heads = num_kv_heads
        self.head_dim = head_dim
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, seq, _ = x.shape
        k = self.proj(x).view(batch, seq, self.num_kv_heads, self.head_dim).transpose(1, 2)
        return k

class VProjectionMachine(nn.Module):
    def __init__(self, hidden_size: int, num_kv_heads: int, head_dim: int):
        super().__init__()
        self.proj = nn.Linear(hidden_size, num_kv_heads * head_dim, bias=False)
        self.num_kv_heads = num_kv_heads
        self.head_dim = head_dim
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, seq, _ = x.shape
        v = self.proj(x).view(batch, seq, self.num_kv_heads, self.head_dim).transpose(1, 2)
        return v

class RoPEMachine(nn.Module):
    def __init__(self, head_dim: int, max_seq_len: int, rope_theta: float = 10000.0):
        super().__init__()
        self.max_seq_len = max_seq_len
        inv_freq = 1.0 / (rope_theta ** (torch.arange(0, head_dim, 2).float() / head_dim))
        self.register_buffer("inv_freq", inv_freq, persistent=False)
    def forward(self, x: torch.Tensor, position_ids: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        ValidationMachines.check_position_ids_range(position_ids, self.max_seq_len)
        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
        position_ids_expanded = position_ids[:, None, :].float()
        freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)
        emb = torch.cat((freqs, freqs), dim=-1)
        cos, sin = emb.cos(), emb.sin()
        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
    @staticmethod
    def apply_rotary(q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor):
        cos_unsq = cos.unsqueeze(1)
        sin_unsq = sin.unsqueeze(1)
        half = q.shape[-1] // 2
        q1, q2 = q[..., :half], q[..., half:]
        k1, k2 = k[..., :half], k[..., half:]
        q_embed = torch.cat([q1 * cos_unsq - q2 * sin_unsq, q1 * sin_unsq + q2 * cos_unsq], dim=-1)
        k_embed = torch.cat([k1 * cos_unsq - k2 * sin_unsq, k1 * sin_unsq + k2 * cos_unsq], dim=-1)
        return q_embed, k_embed

class AttentionComputeMachine(nn.Module):
    def __init__(self, num_heads: int, num_kv_heads: int, head_dim: int, hidden_size: int, attention_dropout: float = 0.0):
        super().__init__()
        self.num_kv_groups = num_heads // num_kv_heads
        self.scaling = head_dim ** -0.5
        self.attention_dropout = attention_dropout
        self.o_proj = nn.Linear(num_heads * head_dim, hidden_size, bias=False)
    @staticmethod
    def _repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
        if n_rep == 1: return x
        b, n_kv, s, d = x.shape
        return x[:, :, None, :, :].expand(b, n_kv, n_rep, s, d).reshape(b, n_kv * n_rep, s, d)
    def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        b, _, seq_len, _ = q.shape
        k = self._repeat_kv(k, self.num_kv_groups)
        v = self._repeat_kv(v, self.num_kv_groups)
        attn_weights = torch.matmul(q, k.transpose(2, 3)) * self.scaling
        if attention_mask is not None:
            ValidationMachines.check_mask_dim(attention_mask)
            attn_weights = attn_weights + attention_mask[:, :, :, :k.shape[-2]]
        attn_weights = F.softmax(attn_weights, dim=-1)
        attn_weights = F.dropout(attn_weights, p=self.attention_dropout, training=self.training)
        attn_output = torch.matmul(attn_weights, v)
        attn_output = attn_output.transpose(1, 2).contiguous().reshape(b, seq_len, -1)
        return self.o_proj(attn_output)

class FFNMachine(nn.Module):
    def __init__(self, hidden_size: int, intermediate_size: int):
        super().__init__()
        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))

# ==========================================================================
# 池A：纯原子机床 —— MoE专家网络 (已修复维度错位问题)
# ==========================================================================
class MoEMachine(nn.Module):
    def __init__(self, hidden_size: int, moe_intermediate_size: int, num_experts: int, top_k: int):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.gate = nn.Linear(hidden_size, num_experts, bias=False)
        # 修复：权重形状统一为 (out_features, in_features)，符合 F.linear 规范
        self.up_proj = nn.Parameter(torch.empty(num_experts, moe_intermediate_size, hidden_size))
        self.gate_proj = nn.Parameter(torch.empty(num_experts, moe_intermediate_size, hidden_size))
        self.down_proj = nn.Parameter(torch.empty(num_experts, hidden_size, moe_intermediate_size))

    def _get_expert_output(self, x: torch.Tensor, expert_id: int) -> torch.Tensor:
        ValidationMachines.check_expert_id_range(expert_id, self.num_experts)
        gate_out = F.linear(x, self.gate_proj[expert_id])
        up_out = F.linear(x, self.up_proj[expert_id])
        # 修复：去掉了错误的 .T 转置
        return F.linear(F.silu(gate_out) * up_out, self.down_proj[expert_id])

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        b, s, d = hidden_states.shape
        flat = hidden_states.view(-1, d)
        router_logits = self.gate(flat)
        weights = F.softmax(router_logits, dim=1, dtype=torch.float)
        weights, experts = torch.topk(weights, self.top_k, dim=-1)
        weights = weights / weights.sum(dim=-1, keepdim=True)
        weights = weights.to(hidden_states.dtype)
        output = torch.zeros_like(flat)
        expert_mask = F.one_hot(experts, num_classes=self.num_experts).permute(2, 1, 0)
        for eid in range(self.num_experts):
            idx, top_x = torch.where(expert_mask[eid])
            if top_x.numel() == 0:
                continue
            expert_out = self._get_expert_output(flat[top_x], eid)
            output.index_add_(0, top_x, expert_out * weights[top_x, idx, None])
        return output.view(b, s, d)

# ==========================================================================
# 池A：多模态嵌入合并机床
# ==========================================================================
class MultimodalEmbeddingMachine:
    @staticmethod
    def merge(input_ids: torch.Tensor, text_embeds: torch.Tensor, multimodal_embeddings: List[torch.Tensor], image_token_id: int) -> torch.Tensor:
        is_image = (input_ids == image_token_id)
        num_placeholders = is_image.sum().item()
        total_mm_tokens = sum(emb.shape[0] for emb in multimodal_embeddings)
        if total_mm_tokens != num_placeholders:
            raise ValueError(f"Multimodal tokens {total_mm_tokens} != placeholders {num_placeholders}")
        inputs_embeds = text_embeds.clone()
        flat_mm = torch.cat([emb.view(-1, emb.shape[-1]) for emb in multimodal_embeddings], dim=0)
        inputs_embeds[is_image] = flat_mm
        return inputs_embeds

# ==========================================================================
# L2编排层：解码器层调度员 (已修复 Cache 内存泄漏与重复拼接)
# ==========================================================================
class DecoderLayerScheduler(nn.Module):
    def __init__(self, config: ModelConfig):
        super().__init__()
        self.q_proj = QProjectionMachine(config.hidden_size, config.num_attention_heads, config.head_dim)
        self.k_proj = KProjectionMachine(config.hidden_size, config.num_key_value_heads, config.head_dim)
        self.v_proj = VProjectionMachine(config.hidden_size, config.num_key_value_heads, config.head_dim)
        self.attn = AttentionComputeMachine(config.num_attention_heads, config.num_key_value_heads, config.head_dim, config.hidden_size, config.attention_dropout)
        self.input_norm = RMSNormMachine(config.hidden_size, config.rms_norm_eps)
        self.post_norm = RMSNormMachine(config.hidden_size, config.rms_norm_eps)
        self.use_moe = config.moe_num_experts > 0
        if self.use_moe:
            self.moe = MoEMachine(config.hidden_size, config.moe_intermediate_size, config.moe_num_experts, config.moe_top_k)
            self.shared_expert = FFNMachine(config.hidden_size, config.share_expert_dim)
        else:
            self.mlp = FFNMachine(config.hidden_size, config.intermediate_size)

    def forward(self, h, cos, sin, mask=None, past_k=None, past_v=None):
        normed = self.input_norm(h)
        q = self.q_proj(normed)
        k = self.k_proj(normed)
        v = self.v_proj(normed)
        q, k = RoPEMachine.apply_rotary(q, k, cos, sin)
        
        # 修复：这里只进行计算所需的注意力拼接，不修改原始 k, v
        k_attn = CachePool.merge_kv(past_k, k)
        v_attn = CachePool.merge_kv(past_v, v)
        
        attn_out = self.attn(q, k_attn, v_attn, mask)
        h = h + attn_out
        normed = self.post_norm(h)
        ffn_out = self.moe(normed) + self.shared_expert(normed) if self.use_moe else self.mlp(normed)
        h = h + ffn_out
        
        # 修复：只返回当前步生成的 k, v，交由外层 CachePool 统一拼接管理
        return h, k, v

# ==========================================================================
# L2编排层：多层调度员 (修复 CachePool 初始化逻辑)
# ==========================================================================
class ModelScheduler(nn.Module):
    def __init__(self, config: ModelConfig):
        super().__init__()
        self.layers = nn.ModuleList([DecoderLayerScheduler(config) for _ in range(config.num_hidden_layers)])
        self.norm = RMSNormMachine(config.hidden_size, config.rms_norm_eps)
        # 读取配置池的固定常量
        self.max_cache_length = config.kv_cache_max_length

    def forward(self, h, cos, sin, mask=None, cache=None):
        # 修复：使用固定常量初始化，保证溢出校验生效
        new_cache = CachePool(max_length=self.max_cache_length)
        
        # 如果传入了历史 cache，需要将其历史状态继承到 new_cache 中用于本次计算
        # 实际工业级实现中，cache 是原地更新的(in-place)，这里遵从原设计的不变性生成 new_cache
        if cache is not None:
            new_cache.key_cache = cache.key_cache.copy()
            new_cache.value_cache = cache.value_cache.copy()
            new_cache.seq_length = cache.seq_length
            
        for i, layer in enumerate(self.layers):
            pk, pv = new_cache.get(i) if new_cache else (None, None)
            h, k, v = layer(h, cos, sin, mask, pk, pv)
            # 将当前步的 k, v 追加到 new_cache 中
            new_cache.update(i, k, v)
            
        h = self.norm(h)
        return h, new_cache

# ==========================================================================
# L1入口层：Step3-VL完整模型
# ==========================================================================
class Step3VLModel(nn.Module):
    def __init__(self, config: ModelConfig):
        super().__init__()
        self.config = config
        self.embed = nn.Embedding(config.vocab_size, config.hidden_size)
        self.rotary = RoPEMachine(config.head_dim, config.max_position_embedding, config.rope_theta)
        self.scheduler = ModelScheduler(config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    def forward(self, input_ids, image_embeddings=None, attention_mask=None, position_ids=None, cache=None):
        ValidationMachines.check_input_dim(input_ids, 2, "input_ids")
        h = self.embed(input_ids)
        if image_embeddings is not None:
            h = MultimodalEmbeddingMachine.merge(input_ids, h, image_embeddings, self.config.image_token_id)
            
        b, s = h.shape[:2]
        if position_ids is None:
            past_len = cache.seq_length if cache else 0
            position_ids = torch.arange(past_len, past_len + s, device=h.device).unsqueeze(0)
        cos, sin = self.rotary(h, position_ids)
        
        h, new_cache = self.scheduler(h, cos, sin, attention_mask, cache)
        logits = self.lm_head(h)
        ValidationMachines.check_tensor_finite(logits, "logits")
        return logits, new_cache