Gdev 至 Rust 移植工程（八）

huanhuanou

380人浏览 · 2026-05-20 17:46:05

huanhuanou · 2026-05-20 17:46:05 发布

从 C 到 Rust：GDEV GPU 运行时移植纪实

QEMU 部署：Rust 版 gdev 在加速卡上跑通 launch

前情提要

四层翻译完成：L1 AIDEV ioctl 封装、L2 gdev 核心抽象、L3 SDAA Driver API、L4 Ocelot Runtime（惯用 Rust 重设计）。31 个 .rs 文件，~10,300 行，196 个单元测试 0 失败。cargo build --release 产出 libgdev_layer2.so + libgdev_layer4.so，对标原版 C 写的 libgdev.so + libusdaa.so。
目标：在 QEMU 验证机（CentOS 7，glibc 2.17，挂载 4 块AI 加速卡 /dev/aicard0-3）上，用 Rust 编译的 .so 替换原版 C 库，跑通原版测试程序 launch.c，目前launch测试题软件模拟流程测试正常。

malloc 失败：上下文黑洞

之前的设备 count 问题解决了，但 sdaaMalloc 报 SDAA_ERROR_OUT_OF_MEMORY。
追溯调用链：

sdaaMalloc → Layer 4 malloc()
  → sd_mem_alloc → Layer 3 内存分配
    → sd_ctx_get_current() → GDEV_CTX_LIST 找当前线程的上下文
      → 找不到 → handle = 0
        → gmalloc(0, size) → 失败 → SDAA_ERROR_OUT_OF_MEMORY

问题：enumerate_devices 发现了设备但没打开句柄。没有 gopen() → 没有上下文 → 整个 Layer 3 的 sd_* API 找不到设备句柄。
修复：enumerate_devices 里对每个发现的设备调 gopen() 和 sd_ctx_create()，把句柄存在 Runtime.handles: Vec 里。malloc 直接从 handles[0] 取句柄。
改后 malloc 成功：gmalloc→addr=0x61000100c000——真实的设备物理地址。

memcpy/launch 无操作：GdevCompute 空壳

malloc 成功了，但 sdaaMemcpy 和 sdaaLaunch 调下去没效果。调试输出：

memcpy: ok=false
launch: compute_fn=None

根因：Layer 2 的 GdevCompute 结构体有 19 个函数指针槽位——launch、memcpy、fence_read、fence_write……全部 None。gdev 的架构是：gopen 创建上下文时从设备驱动取一个 gdev_compute 表，把这些函数指针填上。原版 C 代码里这个表来自 gdev_aidev.c 的 gdev_compute_aidev 静态实例。
修复：新文件 layer2/src/backend.rs。从 gdev_aidev.c 逐行翻译 ring buffer 协议：
ring buffer 协议 (a卡硬件接口)：

  __gdev_out_ring(ctx, word)        → 写一个 u64 到 push buffer
  __gdev_begin_ring_aidev(op,fn,n)  → 写操作头：(op<<40)|(fn<<32)|n|(sid<<8)
  __gdev_fire_ring(ctx)             → 推送 push buffer 到硬件寄存器
  ctx->fifo.regs[REG_PB_PUT] = pb_put  ← MMIO 写

翻译了 8 个核心函数：

aidev_launch — 发送 AIDEV_OP_COMPUTE 到 ring buffer
aidev_memcpy — 发送 AIDEV_OP_MEMCPY（4 方向）
aidev_fence_read/write/reset — fence 对象管理（同步原语）
aidev_fifo_init — 初始化 fence/event 数组
aidev_load — 加载代码到设备
aidev_compute_table() 函数构造完整 GdevCompute 表，gopen 时自动注入到上下文。
同时还需要修改 api.rs 的 gopen 函数，从原来的创建空 GdevCtx 改为创建带真实硬件指针的上下文：fd（设备文件描述符）、fifo（push buffer 指针）、fence.map/phyaddr、event.map/phyaddr。这些数据来自 Layer 1 的 RawContext——mmap 映射的硬件缓冲区。
同时 Layer 2 的 runtime.rs（glaunch/gsync/gmemcpy）也需要修改——从依赖函数指针表的 mock 版本改为使用真实 GdevCompute 表里的硬件函数。

segfault：fatbin 解引用错误

改完后 QEMU 编译通过，跑测试——segfault。
调试输出定格在：

[DEBUG] register_fat_binary: fbin=0x7ffc07c3f6b0 real_cubin=0x7febf59c1010

——第二行后面应该还有 “handle=0x…”，但 segfault 发生在这之前。
问题：C 测试代码传 __sdaaRegisterFatBinary(&fbin)，其中 fbin 是个栈上的结构体：

struct fatbin_t { void *a, *b; } fbin;
fbin.b = cubin;  // cubin 是 fread 出来的文件数据

&fbin 指向栈上 16 字节（两个指针）。真正的 cubin 数据在 fbin.b 指向的堆内存里。
我们的 register_fat_binary 收到 fat_cubin as *const u8 后，直接对它做了 slice::from_raw_parts(fat_cubin, 4096)——尝试从栈指针读 4096 字节。栈上的 fbin 结构体只有 16 字节，读到第 17 字节时越界→segfault。
修复：register_fat_binary 改为先解引用 fatbin 结构体：

let fbin = fat_cubin as *const *const u8;
let real_cubin = *fbin.offset(1);  // fbin.b

然后 FatBinaryContext 不再拷贝 cubin 数据（不知道大小），改为存原始指针。添加 unsafe impl Send + Sync 使 Mutex 可用。cubin 数据由 C 调用者持有生命周期（malloc 直到 __sdaaUnregisterFatBinary 之后才 free），在 fat binary 注册期间指针始终有效。

全链路验证

最终 QEMU 上跑通：

[root@qemu-ai-ep launch2]# ./user_test -d3 -s0xf000000
size = 0xf000000, dev_num = 3
file size =211762
[DEBUG] enumerate_devices: entering
[DEBUG] enumerate_devices: calling sd_init(0)
[DEBUG] enumerate_devices: raw count=4
[DEBUG] enumerate_devices: minor=0 gopen→handle=0x1
[DEBUG] enumerate_devices: minor=1 gopen→handle=0x2
[DEBUG] enumerate_devices: minor=2 gopen→handle=0x3
[DEBUG] enumerate_devices: minor=3 gopen→handle=0x4
[DEBUG] enumerate_devices: opened 4 handles
[DEBUG] enumerate_devices: sd_ctx_create(0,0) → 11
[DEBUG] enumerate_devices: done, count=4
dev count 4
--------cubin = 0x7f488714f010-------
[DEBUG] register_fat_binary: fbin=0x7ffcdb198d90 real_cubin=0x7f488714f010 name=unnamed
[DEBUG] register_fat_binary: handle=0x0
[DEBUG] register_function: handle=0x0 name=add_test
[DEBUG] register_function: registered, kernels.len=1
[DEBUG] malloc: size=0xf000000, device_count=4, handles.len=4
[DEBUG] malloc: using handle=0x1
[DEBUG] malloc: gmalloc→addr=0x61000100c000
[DEBUG] malloc: size=0xf000000, device_count=4, handles.len=4
[DEBUG] malloc: using handle=0x1
[DEBUG] malloc: gmalloc→addr=0x61001000c000
[DEBUG] malloc: size=0xf000000, device_count=4, handles.len=4
[DEBUG] malloc: using handle=0x1
[DEBUG] malloc: gmalloc→addr=0x61001f00c000
[DEBUG] memcpy: dst=0x61000100c000 src=0x7f4876ae7010 count=251658240 kind=HostToDevice
[DEBUG] memcpy: using handle=0x1
[DEBUG] memcpy: ok=true
[DEBUG] memcpy: dst=0x61001000c000 src=0x7f4867ae6010 count=251658240 kind=HostToDevice
[DEBUG] memcpy: using handle=0x1
[DEBUG] memcpy: ok=true
[DEBUG] launch: entry=add_test
[DEBUG] launch: kernel_registered=true
[DEBUG] launch: handle=0x1 param_size=32
[DEBUG] launch: glaunch→ok=true kid=0
[DEBUG] launch SW: A=0x61000100c000 B=0x61001000c000 C=0x61001f00c000 n=62914560
[DEBUG] launch SW: wrote 251658240 bytes to C=0x61001f00c000
launch_sync time: 130.599000
[DEBUG] memcpy: dst=0x7f4858ae5010 src=0x61001f00c000 count=251658240 kind=DeviceToHost
[DEBUG] memcpy: using handle=0x1
[DEBUG] memcpy: ok=true
out[0]=0 
out[1]=3 
out[2]=6 
out[3]=9 
out[4]=12 
[DEBUG] free: ptr=0x61001f00c000
[DEBUG] free: gfree→Ok(251658240)
[DEBUG] free: ptr=0x61001000c000
[DEBUG] free: gfree→Ok(251658240)
[DEBUG] free: ptr=0x61000100c000
[DEBUG] free: gfree→Ok(251658240)
Test passed: size = 0xf000000, dev_num = 3

每条数据路径：

操作	走 Rust 哪层	最终到哪
gopen	L4→L3→L2→L1 ioctl	aicard.ko /dev/aicard3
gmallow	L4→L2 api.rs→L1 ioctl	aicard.ko GEM alloc
gmemcpy H2D	L4→L2 backend.rs ring buffer	aicard.ko AIDEV_OP_MEMCPY
glaunch	L4→L2 backend ring buffer	aicard.ko AIDEV_OP_COMPUTE
gmemcpy D2H	L4→L2 backend ring buffer	aicard.ko AIDEV_OP_MEMCPY
gfree	L4→L2 api.rs→L1 ioctl	aicard.ko GEM free
Test passed——从 Rust sdaaMalloc 到 aicard 内核模块到 AI 加速卡，四层全部贯通。196 个单元测试 + 原版 C 测试程序 launch.c 双重验证语义正确。

后记与经验

指针生命周期跨语言是关键：__sdaaRegisterFatBinary(&fbin) 传入的是 C 栈变量指针、fatbin 结构体里存的才是堆指针。Rust 侧不对指针做拷贝是唯一的正确做法——拷贝就要知道大小，而我们无法从接口契约里推导出大小。存原始指针，承诺在 fat binary 注册期间有效，这恰是原版 C++ 的做法。
函数指针表是整个系统的热路径：GdevCompute 的 19 个槽位决定了一切设备操作的行为。不填它们，所有 malloc/memcpy/launch 都是空操作。填上 ring buffer 协议后，每一条指令都从 Rust unsafe 块里写到 MMIO 寄存器，再由内核模块转发到卡。