从零实现Transformer：第 4 部分 - Residual Connection的两种实现 Pre-LN 和 Post-LN

二分掌柜的

429人浏览 · 2026-05-13 21:11:00

二分掌柜的 · 2026-05-13 21:11:00 发布

从零实现Transformer：第 4 部分 - Residual Connection的两种实现 Pre-LN 和 Post-LN

flyfish

Pre-LN = Pre-Layer Normalization
Post-LN = Post-Layer Normalization

Pre = 预先、在前面
Post = 在后、在末尾
Layer Normalization = 层归一化
Pre-LN：前置层归一化
Post-LN：后置层归一化

Pre-LN vs Post-LN

符号定义
$x$ ：当前模块原始输入
$LN(⋅)\text{LN}(\cdot)$ ：层归一化
$Sublayer(⋅)\text{Sublayer}(\cdot)$ ：子层（自注意力 / FFN前馈网络）
$Dropout(⋅)\text{Dropout}(\cdot)$ ：随机失活
$+$ ：残差连接逐元素相加

Post-LN 公式（原始 Transformer）

$\boldsymbol{y = \text{LN}\Big(\ x + \text{Dropout}\big(\text{Sublayer}(x)\big)\ \Big)}$

对应代码

return self.norm(x + self.dropout(sublayer(x)))

Pre-LN 公式（现代大模型 GPT）

$\boldsymbol{y = x + \text{Dropout}\Big(\ \text{Sublayer}\big(\text{LN}(x)\big)\ \Big)}$

对应代码

return x + self.dropout(sublayer(self.norm(x)))

直接对比

类型	数学公式	关键位置
Post-LN	$\boldsymbol{\text{LN}}\big(x + \text{Dropout}(\text{Sublayer}(x))\big)$	LN 在残差相加外面
Pre-LN	$\text{Dropout}\big(\text{Sublayer}(\boldsymbol{\text{LN}}(x))\big)$	LN 在子层最里面

Post-LN：最后归一
Pre-LN：先归一

AI生成的Post-LN 和 Pre-LN

在这里插入图片描述
FFN（PositionwiseFeedForward，前馈网络）

import torch
import torch.nn as nn

# ===================== 公共模块 两者完全一致，无任何区别 =====================
class LayerNormalization(nn.Module):
    """层归一化"""
    def __init__(self, features: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(features))
        self.beta = nn.Parameter(torch.zeros(features))
    
    def forward(self, x: torch.Tensor):
        mean = x.mean(dim=-1, keepdim=True)
        var = ((x - mean) ** 2).mean(dim=-1, keepdim=True)
        normalized = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * normalized + self.beta

class PositionwiseFeedForward(nn.Module):
    """Transformer前馈网络"""
    def __init__(self, d_model: int, d_ff: int, dropout: float):
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.linear_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.ReLU()
    
    def forward(self, x):
        return self.linear_2(self.dropout(self.activation(self.linear_1(x))))

# ===================== 仅残差连接的 forward 函数不同！ =====================
# 版本1：Post-LN（原始Transformer）
class ResidualConnection_PostLN(nn.Module):
    def __init__(self, features: int, dropout: float):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization(features)

    def forward(self, x, sublayer):
        #  Post-LN 公式：LN(x + Dropout(Sublayer(x)))
        return self.norm(x + self.dropout(sublayer(x)))

# 版本2：Pre-LN（现代大模型 GPT）
class ResidualConnection_PreLN(nn.Module):
    def __init__(self, features: int, dropout: float):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization(features)

    def forward(self, x, sublayer):
        #  Pre-LN 公式：x + Dropout(Sublayer(LN(x)))
        return x + self.dropout(sublayer(self.norm(x)))

# ===================== 【测试代码】验证两种结构 =====================
if __name__ == "__main__":
    # 固定随机种子，保证结果可复现
    torch.manual_seed(42)
    
    # 超参数配置
    d_model = 512    # 模型维度
    d_ff = 2048     # 前馈网络中间维度
    dropout = 0.1   # Dropout概率
    
    # 构造输入：[batch_size, seq_len, d_model]
    x = torch.randn(2, 10, d_model)  
    print(f"输入张量形状: {x.shape}")
    
    # 初始化子层（前馈网络）
    ffn = PositionwiseFeedForward(d_model, d_ff, dropout)
    
    # 1. 测试 Post-LN 残差连接
    post_ln = ResidualConnection_PostLN(d_model, dropout)
    out_post = post_ln(x, ffn)
    print(f"\nPost-LN 输出形状: {out_post.shape}")
 
    
    # 2. 测试 Pre-LN 残差连接
    pre_ln = ResidualConnection_PreLN(d_model, dropout)
    out_pre = pre_ln(x, ffn)
    print(f"\nPre-LN 输出形状: {out_pre.shape}")

输出

输入张量形状: torch.Size([2, 10, 512])

Post-LN 输出形状: torch.Size([2, 10, 512])

Pre-LN 输出形状: torch.Size([2, 10, 512])

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

AI内容检测：用SERP对比识别搜索引擎眼中的“优质内容“

摘要：研究发现AI生成内容在Google排名不佳的原因可能与内容同质化、缺乏E-E-A-T（经验、专业度、权威性）、信息密度低等因素有关。通过实验对比高排名和低排名页面的特征差异，发现高排名内容普遍具有更长的篇幅（+98%）、更多标题（+100%）、图片（+174%）、列表（+133%）和表格（+350%），且更可能包含作者简介（+46pp）、发布日期（+37pp）和结构化数据（+36pp）。这