GE 图优化：从计算图到执行计划

50084

502人浏览 · 2026-05-24 12:09:26

50084 · 2026-05-24 12:09:26 发布

前言

GE（Graph Engine）是 CANN 的核心组件，负责把计算图转成 NPU 可以执行的计划。理解 GE 的优化逻辑，才能写出高性能的模型。

一、计算图的表示

PyTorch 或 ONNX 的模型，在 GE 里被表示成 ComputeGraph 结构：

// GE 内部的计算图结构（简化）
class ComputeGraph {
    std::vector<Node*> nodes_;      // 算子节点
    std::vector<Edge*> edges_;      // 数据边
    std::map<string, Tensor*> tensors_;  // 中间 tensor
};

class Node {
    string op_type_;                // 算子类型：Conv, MatMul, ReLU...
    std::vector<Tensor*> inputs_;   // 输入 tensor
    std::vector<Tensor*> outputs_;  // 输出 tensor
    OpDesc* op_desc_;              // 算子属性
};

导出计算图

用 torch.jit.trace 把 PyTorch 模型转成计算图：

import torch
import torch_npu

model = MyModel().npu().eval（)
traced = torch.jit.trace(model, torch.randn(1, 3, 224, 224).npu())

# 导出 GE 计算图
torch.npu.save_ge_graph(traced, "model_graph.txt")

用 Netron 打开 model_graph.txt，可以看到完整的计算图结构。

二、常量折叠（Constant Folding）

常量折叠是最基础的优化：把编译时能算出来的结果提前算好。

优化前

import torch

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.scale = torch.tensor(2.0)  # 常量
        self.bias = torch.tensor(1.0)   # 常量
    
    def forward(self, x):
        return x * self.scale + self.bias  # 乘法和加法在运行时执行

计算图：

input ──→ [Mul] ──→ [Add] ──→ output
            ↑          ↑
          scale      bias

GE 优化后

如果 scale 和 bias 在编译时已知，GE 会把 Mul + Add 合并成一个算子：

// 优化后的伪代码
output = x * 2.0 + 1.0;  // 编译时算好系数

计算图：

input ──→ [Scale] ──→ output

Scale 算子的参数在编译时已经确定，运行时只需要一次乘加操作。

触发条件

常量折叠的条件：

所有输入都是常量（不是动态 tensor）
算子没有副作用（不会修改全局状态）

# 不会触发常量折叠（输入是动态的）
def forward(self, x):
    scale = torch.tensor(2.0)
    return x * scale  # x 是动态输入，不能提前算

# 会触发常量折叠
def forward(self, x):
    a = torch.tensor(2.0)
    b = torch.tensor(3.0)
    c = a + b  # 编译时算出 c = 5.0
    return x * c

三、公共子表达式消除（CSE）

如果计算图里有重复的计算，GE 会消除重复的子图。

优化前

def forward(self, x):
    a = x + 1
    b = x + 1  # 重复计算
    return a + b

计算图：

x ──→ [Add] ──→ a ──┐
   │      ↑         ├─→ [Add] ──→ output
   └─→ [Add] ──→ b ─┘
          ↑
         1

两个 Add 算子完全相同，浪费计算资源。

GE 优化后

def forward(self, x):
    a = x + 1
    return a + a  # 复用 a

计算图：

x ──→ [Add] ──→ a ──┬─→ [Add] ──→ output
        ↑           │
        1           └─→ (复用)

只执行一次 Add，结果被复用。

CSE 的限制

# 不会触发 CSE（算子有随机性）
def forward(self, x):
    a = torch.randn_like(x)  # 每次结果不同
    b = torch.randn_like(x)
    return a + b  # 不能复用

随机算子、Dropout 等有状态的算子不会触发 CSE。

四、算子融合（Operator Fusion）

算子融合是 GE 最重要的优化：把多个小算子合并成一个大算子，减少显存读写。

Conv + BN + ReLU 融合

# 原始模型
class Model(torch.nn.Module):
    def __init__(self):
        self.conv = torch.nn.Conv2d(3, 64, 3)
        self.bn = torch.nn.BatchNorm2d(64)
        self.relu = torch.nn.ReLU()
    
    def forward(self, x):
        x = self.conv(x)   # 写显存
        x = self.bn(x)     # 读显存、写显存
        x = self.relu(x)   # 读显存、写显存
        return x

三次显存读写，中间结果 conv_out 和 bn_out 都要存到显存。

融合后

GE 会把这三个算子合成一个 ConvBNReLU 算子：

// 融合算子的伪代码
void ConvBNReLU(Tensor input, Tensor weight, Tensor bn_weight, Tensor bn_bias, Tensor output) {
    // 整个计算在 UB 里完成，不写回显存
    for (int i = 0; i < output_size; i++) {
        float conv_out = conv_compute(input, weight, i);
        float bn_out = (conv_out - mean) / std * bn_weight + bn_bias;
        output[i] = relu(bn_out);
    }
}

只需要一次显存写入（最终输出），中间结果在 UB 里流转。

融合条件

GE 支持的融合模板：

融合模式	条件	性能提升
Conv + BN + ReLU	BN 在 eval 模式	40%
MatMul + Add + ReLU	Add 是 bias	15%
FlashAttention	Q/K/V 来自同一输入	3x（Attention 部分）
LayerNorm + Dropout + Residual	Dropout 比例固定	20%

五、内存规划（Memory Planning）

GE 会分析每个 tensor 的生命周期，让不重叠的 tensor 复用同一块显存。

生命周期分析

def forward(self, x):
    a = op1(x)   # a 的生命周期：创建到 op3 使用
    b = op2(a)   # b 的生命周期：创建到 op4 使用
    c = op3(a)   # a 在这里最后使用
    d = op4(b)   # b 在这里最后使用
    return d

时间线：

a: |----------|  (op1 → op3)
b:      |----------|  (op2 → op4)

a 和 b 的生命周期有重叠，不能复用同一块内存。

复用示例

def forward(self, x):
    a = op1(x)
    b = op2(a)  # a 在这里最后使用，可以释放
    c = op3(b)  # c 可以复用 a 的内存
    return c

时间线：

a: |-----|
b:      |----------|
c:           |-----|  (复用 a 的内存)

查看内存规划结果

export GE_MEMORY_PLANNING_LOG=1
atc --model=model.onnx --output=model

日志会显示每个 tensor 的内存偏移和大小，以及内存复用情况。

六、算子调度（Kernel Selection）

同一个算子有多种实现，GE 会选择最优的。

Cube vs Vector

矩阵乘有两种实现：

// Cube 实现（适合大矩阵）
void MatMul_Cube(Tensor A, Tensor B, Tensor C) {
    // 用 Cube Unit 硬件加速
    cube_gemm(A, B, C);
}

// Vector 实现（适合小矩阵或特殊 shape）
void MatMul_Vector(Tensor A, Tensor B, Tensor C) {
    // 用 Vector Unit 软件实现
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            C[i][j] = 0;
            for (int k = 0; k < K; k++) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

GE 会根据矩阵大小选择：

矩阵大小	选择	原因
M, N, K >= 16	Cube	Cube 效率高
M < 16 或 N < 16	Vector	Cube 对小矩阵效率低
K 不是 16 的倍数	Vector	Cube 要求数据对齐

强制选择实现

# 强制使用 Cube 实现
torch.npu.set_op_impl("matmul", impl="cube")
output = torch.matmul(a, b)

# 强制使用 Vector 实现
torch.npu.set_op_impl("matmul", impl="vector")
output = torch.matmul(a, b)

七、查看优化后的计算图

导出优化前后的图

import torch
import torch_npu

model = MyModel().npu().eval（)

# 导出优化前的图
torch.npu.save_ge_graph(model, "before_opt.txt")

# 执行一次推理，触发优化
model(torch.randn(1, 3, 224, 224).npu())

# 导出优化后的图
torch.npu.save_ge_graph(model, "after_opt.txt")

用 Netron 对比两个文件，可以看到优化前后的差异：

节点数量减少（融合、消除）
边的数量减少（复用、消除）
内存占用降低

参考资源

GE 优化原理：https://www.hiascend.com/document/detail/zh/CANN/
算子融合规则：https://www.hiascend.com/document/detail/zh/CANN/
Netron 可视化工具：https://netron.app/
ATC 编译参数说明：https://www.hiascend.com/document/detail/zh/CANN/

总结

GE 的图优化分四个阶段：常量折叠消除编译时能算的结果、CSE 消除重复计算、算子融合减少显存读写、内存规划让 tensor 复用空间。理解这些优化之后，写模型时可以有意配合：把常量提取出来、避免重复计算、用标准算子组合触发融合。用 Netron 查看优化前后的计算图，能直观感受 GE 做了什么。调优的时候，节点数量和显存占用是两个关键指标——节点越少、显存越小，模型性能越好。