The GPU Memory Subsystem as a Mirror of CPU Memory Management: A Structural Analogy

DeeplyMind

379人浏览 · 2026-05-30 12:16:12

DeeplyMind · 2026-05-30 12:16:12 发布

Abstract

This document presents a systematic analysis of the structural and functional parallels between the Linux kernel’s GPU memory management subsystem (TTM, GEM, GPUVM) and the classical CPU memory management (MM) subsystem. We argue that the GPU memory stack is not merely inspired by CPU MM—it is a re-derivation of the same fundamental principles under different hardware constraints. The analogy is not incidental; it is architecturally inevitable, deeply embedded in the kernel’s DRM subsystem through shared terminology, identical algorithmic patterns, and converging design trajectories.

1. Thesis

The Linux GPU memory management subsystem (TTM/GEM/GPUVM) constitutes a domain-specific re-implementation of classical CPU virtual memory concepts—virtual address spaces, demand paging, LRU-based eviction, and swap—adapted to the constraints of discrete accelerator memory hierarchies.

This is evidenced by:

Direct lexical borrowing (swap_storage, TTM_TT_FLAG_SWAPPED, ttm_tt_swapin())
Isomorphic data structure design (drm_gpuvm ↔ mm_struct, drm_gpuva ↔ vm_area_struct)
Identical algorithmic strategies (LRU eviction scanning, shrinker integration, fault-driven population)
Converging evolution (GPU fault handling moving toward CPU-style demand paging via SVM/HMM)

2. Argumentation

2.1 Address Space Management

CPU MM	GPU MM (DRM)	Structural Role
`mm_struct`	`struct drm_gpuvm`	Per-context virtual address space container
`vm_area_struct`	`struct drm_gpuva`	Contiguous VA region mapped to a backing object
VMA red-black tree	GPUVM interval rb-tree	Spatial indexing for fast lookup
`mmap()` / `munmap()`	`drm_gpuvm_sm_map()` / `drm_gpuvm_sm_unmap()`	User-facing VA space mutation
VMA split/merge on partial unmap	`drm_gpuva_op_remap` (split)	Maintaining VA space consistency

The drm_gpuvm documentation states:

“The DRM GPU VA Manager keeps track of a GPU’s virtual address space by using maple_tree structures… There should be one manager instance per GPU virtual address space.”

This is functionally identical to mm_struct managing a process’s virtual address space. The kernel_alloc_node in drm_gpuvm directly mirrors the kernel’s reserved address range in process VA space.

2.2 Backing Storage and Placement

CPU MM	GPU MM (TTM)	Structural Role
Physical RAM (zones)	VRAM (`TTM_PL_VRAM`)	Fast, limited primary storage
Swap device	System memory (`TTM_PL_SYSTEM`)	Slower, larger overflow storage
`struct page`	`struct ttm_resource`	Unit of physical allocation tracking
Page frame allocation (buddy)	`gpu_buddy_alloc_blocks()`	Power-of-two physical allocator
NUMA node affinity	`ttm_place.mem_type`	Placement preference hierarchy

The TTM placement system (struct ttm_placement with an ordered array of struct ttm_place) is analogous to NUMA memory policies—expressing a preference hierarchy for where memory should physically reside.

2.3 The Swap Analogy

This is the most explicit parallel, with TTM directly adopting CPU MM terminology:

struct ttm_tt {
    struct page **pages;
    #define TTM_TT_FLAG_SWAPPED  BIT(0)   // ← CPU swap terminology
    struct file *swap_storage;             // ← shmem backing, like swap
    ...
};

The eviction-to-system-memory path in TTM is structurally identical to page swap-out:

CPU Swap-Out	GPU Eviction (TTM)
Select victim via LRU scan	Select BO via `ttm_resource_manager.lru[]`
Write page contents to swap device	Move BO contents to system memory / shmem (`swap_storage`)
Replace PTE with swap entry	Update `ttm_resource` placement; set `TTM_TT_FLAG_SWAPPED`
Free physical page frame	Free VRAM allocation

The swap-in / fault-in path:

CPU Swap-In	GPU Re-validation
Page fault triggers `do_swap_page()`	Command submission triggers `drm_gpuvm_validate()`
Read from swap, allocate page frame	Call `ttm_bo_validate()` → allocate VRAM, copy back
Install PTE pointing to new frame	Update GPU page table entry
Clear swap entry	Clear `TTM_TT_FLAG_SWAPPED`; call `ttm_tt_swapin()`

The GEM VRAM documentation makes this explicit:

“If there’s no more space left in VRAM, inactive GEM objects can be moved to system memory.”

This sentence is the GPU equivalent of: “If there’s no more physical RAM, inactive pages can be moved to swap.”

2.4 LRU-Based Eviction and Reclaim

Both subsystems use LRU (Least Recently Used) as the primary eviction policy:

CPU MM:

Active/inactive LRU lists per memory zone
kswapd daemon performs background reclaim
Shrinker callbacks for slab caches
lru_gen (multi-generational LRU) for improved aging

GPU MM (TTM/GEM):

ttm_resource_manager.lru[TTM_MAX_BO_PRIORITY] — priority-aware LRU per memory type
drm_gem_lru with drm_gem_lru_scan() — shrinker integration for GEM objects
ttm_pool_type.shrinker_list — TTM page pools registered as kernel shrinkers
drm_mm_scan_init() / drm_mm_scan_add_block() — LRU scan for contiguous eviction

The DRM MM documentation describes the eviction scan pattern:

“Eviction candidates are added using drm_mm_scan_add_block() until a suitable hole is found or there are no further evictable objects.”

This is algorithmically identical to the CPU MM’s shrink_inactive_list() scanning candidates until enough memory is freed.

2.5 Demand Paging and Fault Handling

CPU MM	GPU MM	Structural Role
`handle_mm_fault()`	GEM fault handler (`vm_operations_struct.fault`)	On-demand page/BO population
Lazy allocation (allocate on first touch)	`ttm_tt_populate()` on first use	Defer physical allocation
`FAULT_FLAG_WRITE` → CoW	Pin on write / migrate on access	Access-type-specific handling

The GEM documentation states:

“Drivers are responsible for the actual physical pages allocation by calling shmem_read_mapping_page_gfp() for each page. Note that they can decide to allocate pages when initializing the GEM object, or to delay allocation until the memory is needed (for instance when a page fault occurs).”

This is textbook demand paging, transplanted into the GPU domain.

2.6 Madvise and Purgeability

CPU MM	GPU MM	Structural Role
`madvise(MADV_DONTNEED)`	`DRM_GEM_OBJECT_PURGEABLE` / `shmem.madv`	Hint: memory can be reclaimed
`madvise(MADV_WILLNEED)`	Prefetch operations (`DRM_GPUVA_OP_PREFETCH`)	Hint: memory will be needed soon
Page marked as clean → free without write-back	`ttm_backup_flags.purge` — free without backing up	Optimization: skip write-back

2.7 Hibernation as Full Swap-Out

TTM even handles system hibernation by treating it as a complete swap-out:

int ttm_device_prepare_hibernation(struct ttm_device *bdev);
// "move GTT BOs to shmem for hibernation"

This is the GPU equivalent of the CPU MM writing all active pages to swap during suspend-to-disk.

2.8 Convergence: GPU SVM and HMM

The analogy is not static—it is converging. Modern GPU drivers (AMD KFD SVM, Intel Xe SVM) now implement true shared virtual memory where:

GPU and CPU share the same virtual address space
GPU page faults are handled like CPU page faults
HMM (hmm_range_fault()) bridges the two worlds
Device-private pages (MEMORY_DEVICE_PRIVATE) appear as swap entries in CPU page tables

This convergence validates the thesis: the GPU memory subsystem was always solving the same problem as CPU MM, and the two are now literally merging through HMM.

3. Architectural Mapping (Complete)

┌─────────────────────────────────────────────────────────────────┐
│                    STRUCTURAL ISOMORPHISM                       │
├──────────────────────────┬──────────────────────────────────────┤
│       CPU MM             │          GPU MM (DRM/TTM)            │
├──────────────────────────┼──────────────────────────────────────┤
│ mm_struct                │ drm_gpuvm                            │
│ vm_area_struct           │ drm_gpuva                            │
│ struct page / folio      │ ttm_resource / drm_gem_object        │
│ Physical RAM             │ VRAM (TTM_PL_VRAM)                   │
│ Swap space               │ System memory (TTM_PL_SYSTEM)        │
│ Page tables (PGD→PTE)    │ GPU page tables (driver-specific)    │
│ Buddy allocator          │ gpu_buddy / drm_mm range allocator   │
│ LRU lists + kswapd       │ ttm_resource_manager.lru[] + shrinker│
│ do_swap_page()           │ ttm_tt_swapin()                      │
│ swap entry in PTE        │ TTM_TT_FLAG_SWAPPED                  │
│ madvise(MADV_DONTNEED)   │ DRM_GEM_OBJECT_PURGEABLE             │
│ handle_mm_fault()        │ ttm_tt_populate() / GEM fault handler│
│ migrate_pages()          │ ttm_bo_validate() (move between mems)│
│ mmu_notifier             │ drm_gpuvm_bo_evict() callbacks       │
│ /proc/pid/maps           │ debugfs GPU VA dump                  │
│ OOM killer               │ Eviction failure → -ENOMEM           │
└──────────────────────────┴──────────────────────────────────────┘

4. Why the Analogy Is Architecturally Inevitable

The convergence is not coincidental. Both subsystems solve the same abstract problem:

Given a processor with a virtual address space larger than its fast local memory, multiplex limited physical storage among competing consumers using indirection (page tables), lazy allocation (demand paging), and capacity management (eviction/swap).

The only differences are:

Granularity: CPU MM operates at page granularity (4KB–2MB); GPU MM often operates at buffer-object granularity (KB–GB).
Coherence model: CPU has hardware cache coherence; GPU requires explicit flush/invalidate or domain transitions.
Fault latency tolerance: CPU page faults stall a single thread; GPU “faults” (eviction+revalidation) are batched at submission time (though modern GPUs now support true page faults).
Multiplexing unit: CPU multiplexes among processes; GPU multiplexes among buffer objects (though GPUVM now provides per-process GPU VA spaces).

5. Documented Sources

5.1 Primary Kernel Documentation

DRM Memory Management — The canonical reference. TTM, GEM, GPUVM, DRM MM, Buddy Allocator all documented here. Contains the swap_storage, LRU, shrinker, and eviction APIs.
DRM GPUVM — Documents the GPU virtual address space manager with eviction tracking, split/merge, and validation.

5.2 In-Tree Source Code (self-documenting)

File	Relevant Analogy
`drivers/gpu/drm/ttm/ttm_tt.c`	`ttm_tt_swapin()`, `ttm_tt_swapout()`, `swap_storage`
`drivers/gpu/drm/ttm/ttm_bo.c`	LRU eviction, `ttm_bo_validate()`
`drivers/gpu/drm/ttm/ttm_pool.c`	Page pool with shrinker (mirrors slab shrinker)
`drivers/gpu/drm/drm_gpuvm.c`	GPU VA space management, eviction lists
`drivers/gpu/drm/drm_gem.c`	`drm_gem_lru_scan()` — shrinker for GEM objects
`include/drm/ttm/ttm_tt.h`	`TTM_TT_FLAG_SWAPPED`, `struct ttm_tt`

5.3 Conference Presentations and Articles

XDC (X.Org Developer’s Conference) — Christian König’s presentations on TTM rework explicitly discuss the memory hierarchy and eviction model.
LWN.net — Articles on DRM memory management, GPUVM (Danilo Krummrich’s series, 2023), and VM_BIND discuss GPU VA management in terms familiar to CPU MM developers.
“GEM - the Graphics Execution Manager” (LWN, 2008) — The foundational article establishing GEM’s design philosophy around shmem-backed objects.

5.4 The HMM Bridge

The HMM (Heterogeneous Memory Management) subsystem is the ultimate evidence of this analogy, as it literally bridges the two:

Device-private pages appear as swap-like entries in CPU page tables
hmm_range_fault() mirrors handle_mm_fault() for device memory
migrate_vma_*() extends migrate_pages() to device memory

6. Conclusion

The GPU memory management subsystem in Linux is not merely analogous to CPU memory management—it is a convergent re-derivation of the same solutions to the same fundamental problem of virtual memory management under physical scarcity. The evidence is:

Lexical: TTM uses CPU MM terminology (swap, swapin, populate, evict, LRU, shrinker).
Structural: drm_gpuvm/drm_gpuva mirror mm_struct/vm_area_struct in role and implementation.
Algorithmic: LRU-based eviction scanning, demand population, and shrinker integration follow identical patterns.
Evolutionary: The two subsystems are actively converging through HMM/SVM, with GPU drivers now participating directly in CPU MM’s page fault and migration infrastructure.

For kernel developers, understanding CPU MM provides a direct conceptual framework for understanding GPU memory management—and vice versa. The DRM subsystem’s memory management is best understood not as a novel design, but as classical virtual memory theory applied to a different class of processor.

References

AtomGit开源社区

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念，把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起，为开发者提供从开发、训练到部署的一站式体验。

更多推荐

Python第三方库：Click

AtomGit开源社区

GBrain-13年打磨的Agent第二大脑-16KStars开源即爆

2026 年 4 月 10 日，Y Combinator 总裁 Garry Tan 把他运行了 13 年的个人知识系统开源。24 小时 5400 Stars，一个月后破 16K Stars。这不是另一个"笔记工具"——这是一个正在生产环境运转的 Agent 长期记忆系统：45,798 个页面、98K 个数据块、25K 条实体关联、19 个定时任务全天候自动运转，全部装在一个 MIT 开源仓库里。

AtomGit开源社区

H100 GPU显存故障怎么办？一文读懂HBM修复与专业维修方案

更可怕的是——有些已经完成的计算结果，数据悄悄损坏了，而没人及时发现。H100采用的是HBM3（High Bandwidth Memory 3），通过TSV（硅穿孔）技术垂直堆叠显存颗粒，再通过位于GPU正下方的硅中介层（Silicon Interposer）与GPU芯片互联。**重要提示**：Double Bit ECC错误出现后，GPU仍然可以"带病运行"，但计算结果的正确性已经无法保证。找对