rasdaemon 故障码与系统影响完整参考手册

范围:rasdaemon 全量故障域(MCE / EDAC / AER / ARM / CXL / extlog / non-standard / memory-failure / devlink / diskerror / signal / reri / erst)
版本基线:当前 master 分支(截至 2026-06-02)
性质:错误码字典 + 系统影响分析 + 来源溯源(file:line


0. 总览:rasdaemon 处理哪些事件

rasdaemon 是 Linux RAS(Reliability, Availability, Serviceability) 事件的被动观察者 + 记录器

Trace Event 主要源文件
MCE ras:mce_record ras-mce-handler.c + 16 个 mce-*.c 解码器
EDAC/MC ras:mc_event ras-mc-handler.c, ras-page-isolation.c
PCIe AER ras:aer_event ras-aer-handler.c
ARM ras:arm_event ras-arm-handler.c
CXL 9 个 cxl:* 事件 ras-cxl-handler.c
extlog ras:extlog_mem_event ras-extlog-handler.c
non-standard 各类 CPER 段 ras-non-standard-handler.c + 6 个 vendor 解码器
memory-failure ras:memory_failure_event ras-memory-failure-handler.c
devlink devlink:devlink_health_report, net:net_dev_xmit_timeout ras-devlink-handler.c
diskerror block:block_rq_error ras-diskerror-handler.c
signal signal:signal_generate ras-signal-handler.c
reri riscv_reri_event ras-reri-handler.c
erst 文件:/sys/fs/pstore/erst/mce-erst* ras-erst.c

rasdaemon 不会主动 panic、reset CPU、kill 进程、offline page——这些策略全部由内核决定。rasdaemon 只做:

  1. 解析事件
  2. 输出到 syslog / 控制台
  3. 写入 SQLite(若 --enable-sqlite3
  4. 上报 ABRT(若 --enable-abrt-report
  5. 触发用户脚本(AER_CE_TRIGGER / AER_UE_TRIGGER / MC_CE_TRIGGER / MC_UE_TRIGGER / MEM_FAIL_TRIGGER
  6. 物理页面退役(仅 EDAC CE / CXL DER CE-threshold,写 /sys/devices/system/memory/{soft,hard}_offline_page

1. MCE(Machine Check Exception)— 全量错误码

1.1 通用 MCA 架构错误码(所有 Intel CPU 共享)

定义:mce-intel.c:98-106 (mca_msg[])

MCA Code 名称 系统影响
0x0000 No Error
0x0001 Unclassified 未分类错误,通常是内部硬件故障
0x0002 Microcode ROM Parity Error 微码 ROM 奇偶校验错,可能需要更新微码;通常为 UC+AR,会触发包重置
0x0003 External Error 来自外部引脚的 MCE(pin 触发),来源需要查硬件
0x0004 FRC (Functional Redundancy Check) Error FRC 主备不一致,可恢复
0x0005 Internal Parity Error 内部奇偶校验错(cache/tag/buffer),通常 UC+PCC
0x0006 SMM Handler Code Access Violation SMM 处理器访问违规(严重 SMM 漏洞)

MCA 前缀分类mce-intel.c:221-286):

前缀 类别 含义
0x0xxx 上表 7 项 基础架构错误
bit 12 set 软提示 “corrected filtering: same region has more unreported errors”
0x0C..0x0F 通用内存层级错误 LL0…LL3 通用内存错误
0x4xxx TLB 错误 TT (Instruction/Data/Generic) + LL (L0…L3)
0x8xxx..0xBFFF Cache 层级错误 TT/LL + RRRR (Read/Write/Instruction-Fetch/Prefetch/Eviction/Snoop 等)
0xA0xx 内部未分类 0xA00 = Internal Timer error
0xBxxx 总线/互连错误 LL + PP (Local-CPU-originated/Responded/Third-party/Generic) + RRRR + II + T
0x7xxx 内存控制器错误 decode_memory_controller() 处理

AR 状态位(status[55:56])→ arstate[]mce-intel.c:114-119):

状态位 名称 含义 系统影响
S=0, AR=0 UCNA Uncorrected, No Action 系统继续,但需记录
S=0, AR=1 AR Uncorrected, Action Required 需 reset/重启
S=1, AR=0 SRAO Software Recoverable, Action Optional 上下文可能可恢复
S=1, AR=1 SRAR Software Recoverable, Action Required RIPV=1,软件可继续

全局状态位decode_mci()mce-intel.c:299-332):

状态位 名称 含义
bit 63 VAL 寄存器有效(无效则 MCE_INVALID)
bit 62 OVER Error overflow(更多错误被丢)
bit 61 UC Uncorrected(1=UC,0=CE)
bit 60 EN Error enabled
bit 57 PCC Processor Context Corrupt
bit 56 S Signaling(IP 可重启)
bit 55 AR Action Required
bit 54-53 跟踪颜色 green=正常 / yellow=“Large number of corrected cache errors” 预警 / res3
bit 44 Deferred(AMD) 延迟错误

1.2 特殊 bank:thermal / timeout

mce-intel.c:152-192

Bank 错误 描述 系统影响
128+0 (THERMAL) bit 0 set “Processor N heated above trip temperature. Throttling enabled” CPU 降频;性能影响;“请检查系统散热”
128+0 (THERMAL) bit 0 clear “below trip temperature. Throttling disabled” 恢复正常
128+90 (TIMEOUT) “Timeout waiting for exception on other CPUs” 远端 CPU 出现致命 MCE 但未上报 致命

1.3 Intel 各代平台错误码

1.3.1 Nehalem 内部错误(mce-intel-nehalem.c:75-83
MSC 名称 影响
0x00 No Error
0x03 Reset firmware did not complete 启动失败
0x08 Received an invalid CMPD 包损坏
0x0A Invalid Power Management Request 电源管理请求非法
0x0D Invalid S-state transition 休眠状态机异常
0x11 VID controller does not match POC controller selected 电压控制器不匹配
0x1A MSID from POC does not match CPU MSID 平台 ID 不匹配
1.3.2 Sandy Bridge PCU bank 4(mce-intel-sb.c:16-46
MSC 名称
0x0D MC_IMC_FORCE_SR_S3_TIMEOUT
0x0E MC_MC_CPD_UNCPD_ST_TIMEOUT
0x0F MC_PKGS_SAFE_WP_TIMEOUT
0x43 MC_PECI_MAILBOX_QUIESCE_TIMEOUT
0x5C MC_MORE_THAN_ONE_LT_AGENT
0x60…0x64 MC_INVALID_PKGS_REQ_PCH/QPI/RES/PKGC_RES_PCH/STATE_CONFIG
0x70…0x72 MC_WATCHDG_TIMEOUT_PKGC_SLAVE/MASTER/PKGS_MASTER
0x7A MC_HA_FAILSTS_CHANGE_DETECTED
0x81 MC_RECOVERABLE_DIE_THERMAL_TOO_HOT
1.3.3 Sandy Bridge / IVB 内存控制器 bank 8-11(mce-intel-sb.c:54-62
名称
0x001 Address parity
0x002 HA Wrt buffer data parity
0x004 HA Wrt byte enable parity
0x008 Corrected patrol scrub
0x010 Uncorrected patrol scrub
0x020 Corrected spare
0x040 Uncorrected spare

IVB 还多 0x080 “Corrected memory read error” / 0x100 “iMC, WDB, parity errors”(mce-intel-ivb.c:59-69)。

1.3.4 Haswell QPI bank 5/20/21(mce-intel-haswell.c:69-83
MSC 名称
0x02 QPI physical layer detected drift buffer alarm
0x03 QPI physical layer detected latency buffer rollover
0x10 QPI link layer detected control error from R3QPI
0x11 Rx entered LLR abort state on CRC error
0x12 Unsupported or undefined packet
0x13 QPI link layer control error
0x15 RBT used un-initialized value
0x20 QPI in-band reset but aborted initialization
0x21 Link failover data self healing
0x22 Phy detected in-band reset (no width change)
0x23 Link failover clock failover
0x30 Rx detected CRC error - successful LLR after Phy re-init
0x31 Rx detected CRC error - successful LLR without Phy re-init
1.3.5 Skylake UPI bank 5/12/19(mce-intel-skylake-xeon.c:65-102
MSC 名称 类别
0x00 UC Phy Initialization Failure UC
0x01 UC Phy detected drift buffer alarm UC
0x02 UC Phy detected latency buffer rollover UC
0x10 UC LL Rx detected CRC error: unsuccessful LLR UC
0x11 UC LL Rx unsupported or undefined packet UC
0x12 UC LL or Phy control error(再细分到 upi_0x12[] UC
0x13 UC LL Rx parameter exchange exception UC
0x1F UC LL detected control error from link-mesh interface UC
0x20 COR Phy initialization abort CE
0x21 COR Phy reset CE
0x22 COR Phy lane failure, recovery in x8 width CE
0x23 COR Phy L0c error corrected without Phy reset CE
0x24 COR Phy L0c error triggering Phy Reset CE
0x25 COR Phy L0p exit error corrected with Phy reset CE
0x30 COR LL Rx detected CRC error - successful LLR without Phy Reinit CE
0x31 COR LL Rx detected CRC error - successful LLR with Phy Reinit CE
1.3.6 Skylake M2M bank 7/8(mce-intel-skylake-xeon.c:145-155
MSC 名称
16 MscodDataRdErr
18 MscodPtlWrErr
19 MscodFullWrErr
20 MscodBgfErr
21 MscodTimeout
22 MscodParErr
23 MscodBucket1Err
1.3.7 Icelake iMC 6 个 code page(mce-intel-i10nm.c:155-246

主要 page 0 (imc_0[]):

MSC 名称
0x01 Address parity error
0x02 Data parity error
0x03 Data ECC error
0x07 Transaction ID parity error
0x08 Corrected patrol scrub error
0x10 Uncorrected patrol scrub error
0x20 Corrected spare error
0x40 Uncorrected spare error
0x80 Corrected read error
0xA0 Uncorrected read error
0xC0 Uncorrected metadata

Page 2 (imc_2[]): DDR4/HBM 命令地址奇偶校验。Page 8 (imc_8[]): DDR-T 调度器/CMI/TME 错误集。

1.3.8 Granite Rapids MCCHAN(mce-intel-granite.c:107-157

7 个 page,针对 13-24 bank:

Page 典型错误
0 Address Parity、CMI Wr data/BE/MAC parity、Patrol/Spare、Demand/Underfill Read 错误、Poison 读取、Read 2LM MetaData
1 WDB Read Parity/ECC/BE、DDR Link Fail、Illegal opcode
2 DDR CAParity or WrCRC
4 Scheduler address parity
8 MC Internal Errors、MCTracker Address RF parity
32, 33 sCH1 对应 page 0/1 的副本
1.3.9 跨代通用:PCU MCA 0x402/0x403/0x406/0x407

Haswell 起所有 Intel CPU 都有:

MCA 名称
0x402 PCU Internal Errors
0x403 Other UBOX / VCU Internal Errors
0x406 Intel TXT Errors
0x407 Other UBOX Internal Errors

1.4 AMD K8 北桥扩展错误(mce-amd-k8.c:71-91

bank 4 (Northbridge) status[16:19]

ExtErr 名称 严重性 系统影响
0 RAM ECC error CE/UE DRAM 数据错误
1 CRC error CE 链路 CRC,可重试
2 Sync error UE 链路协议错
3 Master abort UE 总线无响应
4 Target abort UE 目标设备错误
5 GART error CE 图形地址转换表(K8 已忽略)
6 RMW error CE 读改写失败
7 Watchdog error UE 链路 watchdog
8 RAM Chipkill ECC error CE 多符号 DRAM ECC
9 DEV Error UE 设备错误
10 Link Data Error CE HT 链路数据
11 Link Protocol Error UE HT 协议错
12 NB Array Error CE 北桥阵列
13 DRAM Parity Error UE DRAM 奇偶
14 Link Retry CE 链路重试
15 Table Walk Data Error UE 页表遍历数据
16 L3 Cache Data Error CE L3 数据
17 L3 Cache Tag Error CE L3 标签
18 L3 Cache LRU Error CE L3 LRU 位

K8 通用模式decode_k8_generic_errcode()):

  • (ec & 0xfff0) == 0x0010 → LB/TLB 错误(TT + Level)
  • (ec & 0xff00) == 0x0100 → Memory/Cache(memtrans + tx + level)
  • (ec & 0xf800) == 0x0800 → Bus(PP + TO + memtrans + level)

K8 状态高位highbits[] 包含 bit 29 UC, bit 25 PCC, bit 14 CE, bit 13 UE, bit 8 scrub 发现。

1.5 AMD SMCA 错误族(Zen 及以后)

SMCA 通过 smca_hwid_mcatypes[]mce-amd-smca.c:730-804)将 24 类 bank 路由到对应表。下表汇总所有 SMCA bank 及其 XEC

1.5.1 SMCA_LS / LS_V2 — Load Store Unit
类别 XEC 范围 代表性错误
LS v1 0…26 Load queue parity, Store queue parity, Miss address buffer payload parity, L1 TLB parity, DC Tag error 1-6, Internal error 1-2, Sys Read data error, DC data error 1-3 (poison consumption), L2 TLB parity, PDC parity, L2 fill data, SCB cache state/address/data error, WCB SystemReadDataError, Hardware Asserts
LS v2 0…23 同上 + ECC/poison 探针/受害者/load/RMW store;EMEM 读 ECC;TLB1/2/PWC/STQ/LDQ/MAB/SCB entry/WCB/SRB/EMEM 数据掩码 parity;poisoned line in SCB
1.5.2 SMCA_IF — Instruction Fetch(XEC 0…18)

microtag probe port parity, IC microtag/full tag multi-hit, IC full tag/data array parity, PRQ parity, L0/L1/L2 ITLB parity, BPQ snoop parity thread 0/1, BP L1/L2 BTB multi-hit, L2 Cache Response Poison, System Read Data error, Hardware Assertion, L1/L2-TLB Multi-Hit, BSR Parity, CT MCE。

1.5.3 SMCA_L2_CACHE(XEC 0…5)

L2M Tag Multiple-Way-Hit, L2M Tag/State Array ECC, L2M Data Array ECC, Hardware Assert, SDP Read Response Parity, programmable state machine error。

1.5.4 SMCA_DE — Decoder(XEC 0…9)

μop cache tag/data array parity, IBB Register File, μop queue, instruction dispatch queue, fetch address FIFO, Patch RAM data/sequencer, μop fetch queue, Hardware Assertion MCA。

1.5.5 SMCA_EX — Execution(XEC 0…13)

Watchdog timeout, PRF parity, flag register parity, immediate displacement register parity, AG payload, EX payload, checkpoint queue, retire dispatch/status queue, scheduler queue, branch buffer queue, Hardware Assertion, Spec/Retire Map parity。

1.5.6 SMCA_FP — Floating Point(XEC 0…7)

PRF parity, Freelist parity, schedule queue, NSQ, retire queue, status register file, hardware assertion, physical K mask register file。

1.5.7 SMCA_L3_CACHE(XEC 0…9)

Shadow tag macro ECC/multi-way-hit, L3M tag ECC/multi-way-hit, L3M data ECC, SDP Parity from XI, L3 victim queue Data Fabric error, Hardware Assertion, XI WCB Parity Poison Creation, DSM action。

1.5.8 SMCA_CS / CS_V2 / CS_V2_QUIRK — Coherent Slave
变体 XEC 错误
CS v1 0…8 Illegal request, Address violation, Security violation, Illegal response, Unexpected response, Request/Probe Parity, Read Response Parity, Atomic request parity, Probe Filter ECC
CS v2 0…20 CS v1 + SDP read response no match/Unexpected RETRY、Counter over/underflow、no-data channel 上的 Illegal/Address/Security violation、Hardware Assert、Shadow Tag Protocol/ECC/Transaction Error
CS v2 QUIRK (Genoa erratum 1384) 0…17 CS v2 重新排序,去掉 shadow tag 相关项
1.5.9 SMCA_PIE(XEC 0…8)

Hardware assert, Register security violation, Link error, Poison data consumption, deferred error detected in DF, Watch Dog Timer, SRAM ECC in CNLI, Register access during DF Cstate, DSM Error。

1.5.10 SMCA_UMC / UMC_QUIRK / UMC_V2 — Unified Memory Controller
变体 XEC 错误
UMC 0…16 DRAM ECC, Data poison on DRAM, SDP parity, Advanced peripheral bus error, Command/address parity, Write data CRC, DCQ SRAM ECC, AES SRAM ECC, ECS Row Error, ECS Error, UMC Throttling, Read CRC, RFM SRAM ECC
UMC_QUIRK (Turin X3D) 0…15 DRAM On Die ECC, Data poison, SDP parity, Address/Command parity, HBM Write data parity, Consolidated SRAM ECC, Rdb SRAM ECC, Thermal throttling, HBM Read Data Parity, UMC FW Error, SRAM Parity, HBM CRC
UMC_V2 (HBM) 0…11 DRAM ECC, Data poison, SDP parity, Address/Command parity, Write data parity, DCQ SRAM ECC, Read data parity, Rdb SRAM ECC, RdRsp SRAM ECC, LM32 MP errors
1.5.11 SMCA_MA_LLC(XEC 0…6)

Counter over/underflow, Write Data Parity, Read Response Parity, Cache Tag ECC Macro 0/1, Cache Data ECC。

1.5.12 SMCA_PB(XEC 0)

Parameter Block RAM ECC。

1.5.13 SMCA_PSP / PSP_V2
变体 XEC 错误
PSP v1 0 PSP RAM ECC/parity
PSP v2 0…25 High/Low SRAM ECC/parity, Instruction Cache Bank 0/1 ECC/parity, Instruction Tag Ram 0/1 parity, Data Cache Bank 0…3 ECC/parity, Data Tag Bank 0…3 parity, Dirty Data Ram parity, TLB Bank 0/1 parity, System Hub Read Buffer ECC/parity, FUSE IP SRAM ECC/parity, PCRU FUSE SRAM ECC/parity, SIB SRAM parity, mpASP SECEMC, mpASP A5 Hang, SIB WDT
1.5.14 SMCA_SMU / SMU_V2
变体 XEC 错误
SMU v1 0 SMU RAM ECC/parity
SMU v2 0…11, 58…61 High/Low SRAM ECC, Data Cache/Tag Bank A/B, Instruction Cache/Tag Bank A/B, System Hub Read Buffer, PHY RAS ECC + GFX Sub-IP CE/fatal/poison/other
1.5.15 SMCA_MP5(XEC 0…10)

High/Low SRAM ECC, Data Cache/Tag Bank A/B, Instruction Cache/Tag Bank A/B, Fuse SRAM ECC。

1.5.16 SMCA_MPDMA(XEC 0…50)

Main SRAM [31:0]/[63:32]/[95:64]/[127:96] bank ECC/parity;Data/Instruction Cache/Tag Bank A/B;System Hub Read Buffer ECC;MPDMA TVF DVSEC Memory ECC;TVF MMIO Mailbox0/1;Doorbell Memory;SDP Slave/Master Memory 0…7;SDP Watchdog Timer;PTE Command FIFO;Hub Data FIFO;Internal Data FIFO;Command Memory DMA/Internal。

1.5.17 SMCA_NBIO(XEC 0…5)

ECC/Parity, PCIE error, External SDP ErrEvent, SDP Egress Poison, Internal Poison, Internal system fatal。

1.5.18 SMCA_PCIE / PCIE_V2
变体 XEC 错误
PCIE 0…4 CCIX PER Message logging, CCIX Read/Write Response Non-Data Error, CCIX Read Response Data Error, CCIX Non-okay write response with data error
PCIE v2 0 SDP Data Parity Error
1.5.19 SMCA_XGMI_PCS(XEC 0…28)

Data Loss, Training, Flow Control Acknowledge, Rx Fifo Underflow/Overflow, CRC, BER Exceeded, Tx Vcid Data, Replay Buffer Parity, Data Parity, Replay Fifo Overflow/Underflow, Elastic Fifo Overflow, Deskew, Flow Control CRC, Data Startup Limit, FC Init Timeout, Recovery Timeout, Ready Serial Timeout/Attempt, Recovery Attempt/Relock Attempt, Replay Attempt, Sync Header, Tx/Rx Replay Timeout, LinkSub Tx/Rx Timeout, Rx CMD Pocket。

1.5.20 SMCA_XGMI_PHY / WAFL_PHY / GMI_PHY(XEC 0…3)

RAM ECC, ARC instruction buffer parity, ARC data buffer parity, PHY APB error。

1.5.21 SMCA_NBIF / SHUB(XEC 0…3)

Timeout from GMI, SRAM ECC, NTB Error Event, SDP Parity。

1.5.22 SMCA_SATA(XEC 0…7)

Port 0…7 奇偶校验错。

1.5.23 SMCA_USB(XEC 0…5)

S0 RAM0/1/2, PHY RAM0/1 奇偶/ECC, AXI Slave Response 错误。

1.5.24 SMCA_USR_DP(XEC 0…30)

Mst CMD/Rx FIFO/Deskew/Detect Timeout/FlowControl/DataValid FIFO、Mac LinkState、Deskew、Init Timeout/Attempt、Recovery Timeout/Attempt、Eye Training Timeout、Data Startup Limit、LS0 Exit、PLL powerState Update Timeout、Rx FIFO、Lcu、Conv CECC/UECC、Rx DataLoss、Replay CECC/UECC、CRC、BER Exceeded、FC Init Timeout/Attempt、Replay Timeout/Attempt、Replay Underflow/Overflow。

1.5.25 SMCA_USR_CP(XEC 0…12)

Packet Type, Rx FIFO, Deskew, Rx Detect Timeout, Data Parity, Data Loss, Lcu, HB1/HB2 Handshake Timeout, Clk Sleep/Wake Rsp Timeout, Reset Attack, Remote Link Fatal。

1.5.26 SMCA_GMI_PCS(XEC 0…31)

Data Loss, Training, Replay Parity, Rx Fifo Underflow/Overflow, CRC, BER Exceeded, Tx Fifo Underflow, Replay Buffer Parity, Tx Overflow, Replay Fifo Overflow/Underflow, Elastic Fifo Overflow, Deskew, Offline, Data Startup Limit, FC Init Timeout, Recovery Timeout, Ready Serial Timeout/Attempt, Recovery Attempt, Recovery Relock Attempt, Deskew Abort, Rx Buffer, Rx LFDS Fifo Overflow/Underflow, LinkSub Tx/Rx Timeout, Rx CMD Packet, LFDS Training/FC Init Timeout, Data Loss。

1.5.27 AMD 通用错误分类(decode_amd_errcode()mce-amd.c:68-123
模式 含义 分类
UC + PCC Processor context corrupt System Fatal error
UC + RIPV Restartable Uncorrected, software restartable
UC only Containable Uncorrected, software containable
DEFERRED 平台延迟错误 Deferred error, no action required
其他 Corrected Corrected error, no action required
POISON 读到已被下毒数据 Poison consumed
TCC Task context corrupt Task_context_corrupt

通用模式宏mce-amd.c):

  • INT_ERROR(x) = ((x) & 0xF4FF) == 0x0400 → “Internal ‘reserved/hardware assert/reserved/reserved’”
  • TLB_ERROR(x) = ((x) & 0xFFF0) == 0x0010 → TLB tx/level
  • MEM_ERROR(x) = ((x) & 0xFF00) == 0x0100 → Memory mem-tx/tx/level
  • BUS_ERROR(x) = ((x) & 0xF800) == 0x0800 → Bus PP/TO/mem-tx/level

UMC 位置解码decode_smca_error()):Family 0x19, Model 0x90…0x9f → memory_die_id;否则 channel = IPID >> 20csrow = synd & 0x7

1.6 Zhaoxin KH-50000 错误码(mce-zhaoxin-kh50000.c

Bank 类型 子错误来源 错误码 名称
CPU bank status[25:29] 0,1 Unknown
CPU bank 2 Machine hung error(致命)
CPU bank 3 Undefined ucode address error
Cache (PL2 / LLC) status[24:25]status[25:26] 0 Unknown
Cache 1 ECC single bit error for data part in the same line
Cache 2 ECC single bit error for different line
Cache 3 ECC multi bit error for data part
PCIe status[16:23] 0 Fatal
PCIe 1 Non-fatal
PCIe 2 Correctable
IOD ZDI / ZPI XEC 0…23 Receiver overflow (TL), FC protocol, Surprise down, DLL protocol, Replay timer timeout, REPLAY_NUM Rollover, Bad DLLP, Bad TLP, Receiver error (PHY), Phy training, Link-width down-mode (X32X24/X16X12/X8/X4/X2), Link-speed down-mode (GEN4/GEN3/GEN2)
CCD ZDI XEC 0…17 Receive overflow, PHY training, FC protocol, Surprise down, DLLM protocol, DLLM replay timeout, DLLM replay number rollover, Bad DLLP, Bad TLP, Gen2/3/4 unreliable, X2/X4/X8/X16X12/X32X24 unreliable
SVID status[24:31] 0 No error
SVID 1 SVID Resend fail
SVID 2 VRM Over current
SVID 3 VRM Over temp
SVID 4 VRM Parity
DRAM mca_err_code 0x1 DVAD Error
DRAM 0x5 Parity Error
DRAM mem_err_code 0 Generic undefined request
DRAM 1 Memory read
DRAM 2 Memory write
DRAM 3 Address/Command
DRAM 4 Memory scrubbing
DRAM 5 Data poison enable (DRAM, master normal read)
DRAM 6 Data poison enable (DRAM, patrol read)
DRAM 7 Key hit error
DRAM mem_specific 0 Unknown
DRAM 1 Single bit ECC
DRAM 2 Multiple bit ECC
DRAM 3 Command parity
DRAM 4 CRC
DRAM 5 Parity retry failed
DRAM 6 CRC retry failed
DRAM 7…15 DVAD decode (CPUIF CHA0/CHB0/CHA1/CHB1, MCUTRF, GMINT…)
HIF (CXL) status[16:23] 0 Unknown
HIF 1 HIF dvad error
HIF 2 SNT multi bit ecc
HIF 3 SNT single bit ecc
HIF 4 CXL decpoison uc
HIF 5 CXL decpoison ce
HIF 6 CXL parity

1.7 MCE 的系统影响分类

类别 出现条件 实际系统行为
Cache (CE) RR=10, UC=0 透明纠正;用户无感知;green 跟踪;多次同 line 触发 → yellow 预警
Cache (UC) UC=1 #MC → 内核杀进程或 panic
TLB (UC + PCC) 0x4xxx 上下文损坏 → panic
Bus / 互连 (UC) 0xBxxx 可能污染其他核状态,几乎必致命
Internal (UC) 0x0400 / 0xA0xx 未分类硅内部错误;AMD SMCA “Hardware Assert” 跨多 bank
Memory (CE) 0x7xxx / 0x0100 DRAM 单比特纠错,透明
Memory (UE) UC=1 内核 MCE → SIGBUS / kill;可触发镜像切换
Deferred (AMD) bit 44 平台延迟;消费时才暴露
Poison Consumed (AMD) bit 43 CPU 读到被下毒数据 → 内核下毒页

2. EDAC/MC(Memory Controller)错误

2.1 错误类型枚举(enum hw_event_mc_err_typeras-events.h:78-84

Enum 整数 文本 日志级别 系统影响 rasdaemon 行为
HW_EVENT_ERR_CORRECTED 0 Corrected LOG_ERR DRAM 单比特纠错,透明;CE 风暴预警 UC 插入 SQLite + 速率统计 + 页面 PFA(可选)+ 触发 MC_CE_TRIGGER
HW_EVENT_ERR_UNCORRECTED 1 Uncorrected LOG_CRIT 数据损坏 → MCE / SIGBUS 插入 SQLite + 触发 MC_UE_TRIGGER;不自动 retire
HW_EVENT_ERR_DEFERRED 2 Deferred LOG_CRIT 数据损坏不确定(可能没消费) 插入 SQLite;不触发脚本
HW_EVENT_ERR_FATAL 3 Fatal LOG_EMERG 系统完整性破坏 插入 SQLite;不 retire
HW_EVENT_ERR_INFO 4 Info LOG_DEBUG 无实际错误(init/scrub) 插入 SQLite,无其他动作

2.2 位置粒度

ras-mc-handler.c:240-264 解析 (top_layer, middle_layer, lower_layer) 三元组,覆盖:

  • chip-select / row / bank / DIMM / channel
  • 取决于 EDAC 驱动(i10nm、amd64、skx、sb-edac 等)

2.3 页面 / 行退役机制(ras-page-isolation.c

策略 默认 触发条件 写入 sysfs
页面 CE PFA 软离线 4 KiB 页 CE 数 ≥ PAGE_CE_THRESHOLD(默认 50)在 PAGE_CE_REFRESH_CYCLE(24h)窗口 /sys/devices/system/memory/soft_offline_page
行 CE PFA 关闭 整行 CE 数 ≥ ROW_CE_THRESHOLD(默认 50) 同上
模式 soft `PAGE_CE_ACTION=off account

关键事实:rasdaemon 直接调 hwpoison;它只写 sysfs,由内核 mm/memory-failure.c 实际下毒页。


3. PCIe AER 错误码

ras-aer-handler.c:27-60 定义了 aer_cor_errors[]aer_uncor_errors[] 同源,bit 编号相同)。bitfield_msg() 将状态寄存器翻译为文本。

3.1 Correctable Errors(CORR_ERR_STATUS

名称 系统影响 AER 处理
0 Receiver Error PHY 恢复错误,链路 retrain;无数据丢失 自动
6 Bad TLP TLP CRC/ECRC 错,drop 后重放;频率高则吞吐下降 自动重放
7 Bad DLLP DLLP CRC 错,重试 自动重试
8 RELAY_NUM Rollover 多次 TLP 重放;信号完整性差 自动
12 Replay Timer Timeout TLP 重传超时;链路 retrain;瞬时停顿 自动 retrain
13 Advisory Non-Fatal 厂商语义非致命 自动
14 Corrected Internal Error 设备内部 ECC 纠正 自动
15 Header Log Overflow 诊断信息丢失 自动

3.2 Uncorrectable Non-Fatal(UNCORR_ERR_STATUS

名称 系统影响
4 Data Link Protocol 链路 L0→Recovery/Detect 异常;事务重试;可能 hang
5 Surprise Link Down 设备热拔或物理断开;所有 in-flight 事务中止;下游设备消失
12 Poisoned TLP TLP 携带 EP 位(已损坏);DMA 写通知 requester 失败;读完成数据被丢;可能 EIO / page poisoning
13 Flow Control Protocol 接收方 FC 违规;需要 retrain;可能 stall
14 Completion Timeout 请求方超时未收到完成;UR 返回;设备可能 hang
15 Completer Abort 目标无法完成;CA 完成;调用方 EIO
16 Unexpected Completion 完成报文与未发请求不匹配;switch / endpoint bug
17 Receiver Overflow 接收方 buffer 溢出;FCP / DLLP 错误;可能致命
18 Malformed TLP TLP 结构错;事务失败
19 ECRC 端到端 CRC 失败;in-flight 数据损坏;强信号 PHY 问题
20 Unsupported Request 设备不支持该请求类型;UR 完成
21 ACS Violation 访问控制服务阻断 P2P 事务;安全特性触发
22 Uncorrected Internal 设备内部不可纠正;可能设备失声
23 MC Blocked TLP 多播 TLP 被 MC 规则阻断;功能性 no-op
24 AtomicOp Egress Blocked 原子操作 egress 阻断;请求失败
25 TLP Prefix Blocked 厂商前缀被丢;请求失败
26 Poisoned TLP Egress Blocked switch 拒绝转发下毒 TLP;requester 收 UR/CA

3.3 Uncorrectable Fatal

rasdaemon 单独维护 fatal 表,使用与 non-fatal 同一组 aer_uncor_errors[] 位名;fatal 判定由内核 AER 通过 uncorrectable error mask 完成。

典型 fatal 类别(PCIe 规范):Training Error、Link Below Speed、Data Link Protocol Error、Surprise Down、Receiver Overflow、Uncorrected Internal。

3.4 rasdaemon AER 行为

  1. 日志到 syslog(severity 映射:CE→LOG_ERR,UE Non-Fatal→LOG_CRIT,UE Fatal→LOG_EMERG)
  2. 状态位解码 + 可能的 TLP header
  3. libpci 查 vendor/device 名称
  4. 写入 SQLite
  5. ABRT 报告
  6. 转发 OpenBMC Unified SEL(IPMI OEM SEL)
  7. Ampere BMC OEM SEL
  8. 触发 AER_CE_TRIGGER / AER_UE_TRIGGER 用户脚本

:rasdaemon 触发 AER recovery / slot reset,这是内核 AER 驱动(drivers/pci/pcie/aer/aerdrv.c)的职责。


4. ARM 错误码

rasdaemon 仅消费 UEFI 2.9 § N.2.4.4 Processor Error Section(PEI),不处理 SMMU / GIC / CCI / CCN 专用 APEI-GHES section。

4.1 错误类型位(ras-arm-handler.c:36-40

名称 类别
ARM_CACHE_ERROR BIT 1 Cache 错误
ARM_TLB_ERROR BIT 2 TLB 错误
ARM_BUS_ERROR BIT 3 总线错误
ARM_VENDOR_ERROR BIT 4 厂商自定义(绕过标准解码)

4.2 syndrome 字段(PEI error_info 64-bit)

字段 取值
Transaction type 16-17 Instruction / Data Access / Generic
Operation type 18-21 Cache=11 种 / TLB=9 种 / Bus=7 种
Level 22-24 0-7(cache level / TLB level / 亲和性总线 level)
Proc context corrupt 25 bool
Corrected 26 bool
Precise PC 27 bool
Restartable PC 28 bool
Participation type 29-30 Local-originated/Responded/Observed/Generic
Time-out 31 bool
Address space 32-33 External / Internal / Unknown / Device Memory Access
Mem attributes 34-42 9-bit 原始值
Access mode 43 Normal / Secure

4.3 标志位

标志 含义
First error 0 多个错误中的第一个
Last error 1 多个错误中的最后一个
Propagated 2 错误从源传播
Overflow 3 溢出

4.4 严重性(GHES_SEV_*,ras-events.h:94-99

名称 文本 系统影响
0 GHES_SEV_NO Informational 无影响
1 GHES_SEV_CORRECTED Corrected 已纠正;CPU 错误计数++(HAVE_CPU_FAULT_ISOLATION)
2 GHES_SEV_RECOVERABLE Recoverable 已恢复;同上
3 GHES_SEV_PANIC Fatal 致命;仅记录

4.5 厂商特殊解码

仅 Ampere 路径启用:decode_amp_payload0_err_regs()non-standard-ampere.c,受 HAVE_AMP_NS_DECODE 控制)。其他 vendor blob 进 display_raw_data() 原样十六进制打印。

4.6 行为与限制

  • 写 SQLite、ABRT
  • HAVE_CPU_FAULT_ISOLATION 且 sev ∈ {CORRECTED, RECOVERABLE}:调用 count_errors() 计数并喂给 ras_record_cpu_error()
  • 主动 panic、杀进程、page 隔离、CPU 离线 — 全部在 rasdaemon 之前由内核完成

5. CXL 错误码

5.1 事件严重性(ras-cxl-handler.c:492-498,CXL 3.0 §8.2.9.2.2 Table 8-49)

名称
CXL_EVENT_TYPE_INFO 0x00
CXL_EVENT_TYPE_WARN 0x01
CXL_EVENT_TYPE_FAIL 0x02
CXL_EVENT_TYPE_FATAL 0x03

5.2 AER Uncorrectable(ras-cxl-handler.c:274-317

名称 组件 系统影响
0 CXL_AER_UE_CACHE_DATA_PARITY CXL.cache 链路 链路数据奇偶错;可能 poison
1 CXL_AER_UE_CACHE_ADDR_PARITY 链路 错误寻址
2 CXL_AER_UE_CACHE_BE_PARITY 链路 部分写损坏
3 CXL_AER_UE_CACHE_DATA_ECC 设备 cache 不可纠正 ECC;cacheline poison 传到 host
4 CXL_AER_UE_MEM_DATA_PARITY CXL.mem 路径 数据奇偶错
5 CXL_AER_UE_MEM_ADDR_PARITY CXL.mem 路径 错向访问;跨主机污染
6 CXL_AER_UE_MEM_BE_PARITY CXL.mem 路径 部分写损坏
7 CXL_AER_UE_MEM_DATA_ECC CXL.mem 不可纠正 ECC;永久数据丢失;host 收 poison
8 CXL_AER_UE_REINIT_THRESH 链路 链路重初始化阈值;设备瞬时不可用
9 CXL_AER_UE_RSVD_ENCODE 链路 协议级错
10 CXL_AER_UE_POISON 设备对端 收到对端 poison;消费时可能 host MCE
11 CXL_AER_UE_RECV_OVERFLOW 链路接收方 buffer 溢出;flit 丢
14 CXL_AER_UE_INTERNAL_ERR 厂商定义 严重性 vendor 定义
15 CXL_AER_UE_IDE_TX_ERR IDE 发送方 链路完整性/机密性受损
16 CXL_AER_UE_IDE_RX_ERR IDE 接收方 认证/解密失败

5.3 AER Correctable(ras-cxl-handler.c:290-330

名称 组件 系统影响
0 CXL_AER_CE_CACHE_DATA_ECC 设备 cache cache ECC 纠正;监控速率
1 CXL_AER_CE_MEM_DATA_ECC CXL.mem 内存 ECC 纠正;趋势预警
2 CXL_AER_CE_CRC_THRESH 链路 CRC 阈值;信号完整性告警
3 CXL_AER_CE_RETRY_THRESH 链路 重试阈值;延迟/性能影响
4 CXL_AER_CE_CACHE_POISON 设备 cache 收到 cache poison
5 CXL_AER_CE_MEM_POISON CXL.mem 收到内存 poison
6 CXL_AER_CE_PHYS_LAYER_ERR PHY 对端 PHY 错误

5.4 General Media Event Record(GMER,CXL 3.1 §8.2.9.2.1.1 Table 8-45)

memory_event_type

名称
0 ECC Error
1 Invalid Address
2 Data Path Error
3 TE State Violation
4 Scrub Media ECC Error
5 Adv Prog CME Counter Expiration
6 CKID Violation

memory_event_sub_type(0-5):Not Reported / Internal Datapath / Media Link Cmd/CTL/Dat Training / Media Link CRC

transaction_type(0-8):Unknown / Host Read / Host Write / Host Scan Media / Host Inject Poison / Internal Media Scrub / Internal Media Management / Internal Media Error Check Scrub / Media Initialization

Descriptor flags:UNCORRECTABLE(bit0)THRESHOLD(bit1)POISON_LIST_OVERFLOW(bit2)
DPA flags:VOLATILE(bit0)NOT_REPAIRABLE(bit1)

5.5 DRAM Event Record(DER,CXL 3.1 §8.2.9.2.1.2 Table 8-46)

memory_event_type:0=Media ECC / 1=Scrub Media ECC / 2=Invalid Address / 3=Data Path Error / 4=TE State Violation / 5=Adv Prog CME Counter Expiration / 6=CKID Violation

DER 携带完整几何:channel / sub_channel / rank / bank_group / bank / row / column / nibble_mask / correction_mask。

DER 阈值事件是 rasdaemon 中唯一触发自动 page offline 的 CXL 路径(ras-cxl-handler.c:1244-1249ras_hw_threshold_pageoffline(hpa)),需要 HAVE_MEMORY_CE_PFA

5.6 Memory Module Event Record(MMER,CXL 3.1 §8.2.9.2.1.3 Table 8-47)

event_type(0-8):Health Status / Media Status / Life Used / Temperature / Data Path / LSA / Unrecoverable Internal Sideband Bus / Memory Media FRU / Power Management Fault

event_sub_type(0-3):Not Reported / Invalid Config Data / Unsupported Config Data / Unsupported Memory Media FRU

health_status flags:MAINTENANCE_NEEDED / PERFORMANCE_DEGRADED / REPLACEMENT_NEEDED / MEM_CAPACITY_DEGRADED
media_status(0-7):Normal / Not Ready / Write Persistency Lost / All Data Lost / 写持久性会在掉电时丢失 / Imminent / 即将全数据丢失 / All Data Loss Imminent

5.7 Memory Sparing Event Record(MSER,CXL 3.2 §8.2.10.2.1.4 Table 8-60)

flags:QUERY_RESOURCES(BIT0) / HARD_SPARING(BIT1) / DEVICE_INITIATED(BIT2)

注意:MSER handler 持久化到 SQLite(无 #ifdef HAVE_SQLITE3 块),是已知疏漏。

5.8 公共事件头 / Poison List

  • Common Event Record Flags(每事件):PERMANENT(2) / MAINT_NEEDED(3) / PERF_DEGRADED(4) / HW_REPLACE(5) / MAINT_OP_SUB_CLASS_VALID(6) / LD_ID_VALID(7) / HEAD_ID_VALID(8)
  • Poison List Eventcxl_poison):Source ∈ {Unknown(0), External(1), Internal(2), Injected(3), Vendor(7)};Flags MORE/OVERFLOW/SCANNING
  • Event Record Overflowcxl_overflow):log_type ∈ {Info, Warn, Failure, Fatal},丢失 N 条事件,盲区
  • Generic Event Record:80 字节原始数据,hdr_uuid 标识 decoder

5.9 CXL 发现 / 拓扑

  • 遍历 /sys/bus/cxl/
  • 做内存中的 component 缓存
  • 完全依赖内核在 trace payload 中填好的 memdev / host / serial / region / region_uuid / comp_id 字段

6. extlog(Extended Log)错误类型

ras-extlog-handler.c err_type()

etype 名称 类别 系统影响
0 unknown 未分类
1 no error 信息
2 single-bit ECC CE 已纠正
3 multi-bit ECC UE 数据损坏
4 single-symbol chipkill ECC CE 整符号纠正
5 multi-symbol chipkill ECC UE 多个符号损坏;页离线 / DIMM 更换
6 master abort UE 总线主设备中止
7 target abort UE 总线目标中止
8 parity error UE 奇偶错
9 watchdog timeout UE 内存子系统 watchdog 超时
10 invalid address UE 访问非法地址
11 mirror Broken UE 镜像对损坏 → 切备
12 memory sparing INFO 备用 rank 激活
13 scrub corrected error CE 巡检纠正
14 scrub uncorrected error UE 巡检发现 UE
15 physical memory map-out event INFO 物理页被 map-out

err_severity():recoverable (sev=0, LOG_CRIT) / fatal (sev=1, LOG_EMERG) / corrected (sev=2, LOG_ERR) / informational (sev=3, LOG_INFO)


7. Non-Standard(厂商自定义)解码

7.1 通用 GHES 严重性映射

ras-non-standard-handler.c

  • GHES_SEV_NO → Informational
  • GHES_SEV_CORRECTED → Corrected
  • GHES_SEV_RECOVERABLE → Recoverable
  • GHES_SEV_PANIC → Fatal

7.2 HiSilicon HIP08

UUID 1f8161e1-...(Type1) / 45534ea6-...(Type2) / b2889fc9-...(PCIe Local)

OEM Type-1 module_id

id 名称 子模块
0 MN (Miscellaneous Node)
1 PLL TB_PLL0-3 / TA_PLL0-3 / NIMBUS_PLL0-4
2 SLLC TB_SLLC0-2 / TA_SLLC0-2 / NIMBUS_SLLC0-1
3 AA
4 SIOE TB_SIOE0-3 / TA_SIOE0-3 / NIMBUS_SIOE0-1
5 POE TB_POE / TA_POE
8 DISP HAC / PCIE / IO_MGMT / NETWORK
9 LPC
13 GIC
14 RDE
15 SAS SAS0/1
16 SATA
17 USB

OEM Type-2 module_id

id 名称
0 SMMU (HAC/PCIE/MGMT/NIC)
1 HHA (Hydra Home Agent) — TB/TA_HHA0-1
2 PA (Proxy Agent)
3 HLLC — HLLC0-2
4 DDRC (DDR Controller) — TB/TA_DDRC0-3
5 L3T (L3 Tag) — TB/TA_PARTITION0-7
6 L3D (L3 Data) — TB/TA_BANK0-3

PCIe Local sub_module_id

id 名称
0 AP (Application Layer)
1 TL (Transaction Layer)
2 MAC
3 DL (Data Link Layer)
4 SDI

7.3 HiSilicon Common(Kunpeng916/920/930)

UUID c8b328a8-...。~50 个 module 名(MN/PLL/SLLC/AA/SIOE/POE/CPA/DISP/GIC/ITS/AVSBUS/CS/PPU/SMMU/PA/HLLC/DDRC/L3TAG/L3DATA/PCS/HHA/PCIe Local/SAS/SATA/NIC/RoCE/USB/ZIP/HPRE/SEC/RDE/MEE/L4D/Tsensor/ROH/BTC/HILINK/STARS/SDMA/UC/HBMC/PMC/SCHE/ASMB_DFS/ASMB_NTU/UB/UMMU/PCU/UCMI/DJTAGM/CFGBUS/MPU/CRG)

严重性:NFE=0 (recoverable) / FE=1 (fatal) / CE=2 (corrected) / NONE=3

7.4 Ampere

UUID e8ed898d-...

type 名称 子错误
0 CPM Snoop-Logic, ARMv8 Core 0/1
1 MCU (Memory Controller Unit) ERR0-6, Link Error
2 MESH Cross Point, Home Node IO/Memory, CCIX Node
3 2P Link Altra
4 2P Link Altra Max ERR0-3
5 GIC ERR0-12, ITS 0-7
6 SMMU TBU0-9, TCU
7 PCIe AER
8 PCIe RASDP
9 OCM (On-Chip Memory) ERR0-2
10 SMPRO ERR0/ERR1/MPA_ERR
11 PMPRO ERR0/ERR1/MPA_ERR
12 ATF FW EL3, SPM, Secure Partition
13 SMPRO FW
14 PMPRO FW
63 BERT Boot Error Record Table

Payload types:0=ARMv8 RAS (APEI/BMC) / 1=PCIe AER / 2=PCIe RASDP / 3=Firmware-Specific (ATF/SMpro/PMpro/BERT)

7.5 NVIDIA

UUID 6d5244f2-...(最近提交添加)

解码 nvidia_ns_decode() 字段:signature[16]error_typeerror_instanceseveritysocketnumber_regsinstance_baseregs[] (addr/value 对)。

7.6 Jaguar Micro(Corsica1.0)

5 个 UUID + 15 个 subsystem:

subsystem_id 名称 模块
0 AP/CSUB CORE
1 CMN MXP, HNI, HNF, SBSX, CCG, HND
2 DDRH DDRCtrl, DDRPHY, SRAM
3 DDRV DDRCtrl, DDRPHY, SRAM
4 GIC GICIP, GICSRAM
5 IOSUB SMMU(TBU/TCU), NIC450, OTHER(RAM)
6 SCP SRAM, WDT, PLL
7 MCP SRAM, WDT
8 IMU0 SRAM, WDT
9 DPE EPG, PIPE, EMEP, IMEP, EPAE, IPAE, ETH, TPG, MIG, HIG, DPETOP, SMMU
10 RPE TOP, TXP_RXP, SMMU
11 PSUB PCIE0(RAS0/RAS1), UP_MIX, PCIE1, PTOP, N2IF, VPE0/1_RAS, X2RC/X16RC_SMMU, SDMA_SMMU
12 HAC SRAM, SMMU
13 TCM SRAM, SMMU, IP
14 IMU1 SRAM, WDT

严重性:0=recoverable (NFE) / 1=fatal (FE) / 2=corrected (CE) / 3=none

7.7 Yitian(Alibaba T-Head)

UUID a6980811-...

YITIAN_RAS_TYPE_DDR=0x50:DDR ECC 寄存器 dump(ECCCFG0/1、ECCSTAT、ECCERRCNT、ECCCADDR0/1、ECCCSYN0-2、ECCUADDR0/1、ECCUSYN0-2、ECCBITMASK0-2、ADVECCSTAT、ECCAPSTAT、ECCCDATA0/1、ECCUDATA0/1、ECCSYMBOL、ECCERRCNTCTL/STAT、ECCERRCNT0/1、RESERVED0-2)


8. memory-failure 错误码

ras-memory-failure-handler.c。Page types(来自内核 enum mf_action_page_type):

名称 系统影响
0 MF_MSG_KERNEL Poison 命中保留内核页;通常 panic
1 MF_MSG_KERNEL_HIGH_ORDER 高阶内核分配页;通常致命
2 MF_MSG_SLAB 内核对象 slab;通常致命
3 MF_MSG_DIFFERENT_COMPOUND 锁下复合页变化;重试/中止
4 MF_MSG_HUGE 正在用的 hugepage;迁移/杀消费者
5 MF_MSG_FREE_HUGE 空闲 hugepage;离线
6 MF_MSG_UNMAP_FAILED 无法 unmap;页保持 poison
7 MF_MSG_DIRTY_SWAPCACHE 脏 swap cache;可能数据丢失;杀任务
8 MF_MSG_CLEAN_SWAPCACHE 干净 swap cache;丢弃重载
9 MF_MSG_DIRTY_MLOCKED_LRU 脏 mlocked;杀任务
10 MF_MSG_CLEAN_MLOCKED_LRU 干净 mlocked;丢弃重载
11 MF_MSG_DIRTY_UNEVICTABLE_LRU 脏不可驱逐;杀任务
12 MF_MSG_CLEAN_UNEVICTABLE_LRU 干净不可驱逐;丢弃
13 MF_MSG_DIRTY_LRU 脏 LRU;杀任务;可能数据丢失
14 MF_MSG_CLEAN_LRU 干净 LRU;丢弃重载
15 MF_MSG_TRUNCATED_LRU 已截断 LRU;恢复
16 MF_MSG_BUDDY 空闲 buddy 页;从 free list 移除
17 MF_MSG_DAX DAX(pmem/CXL persistent);应用 SIGBUS
18 MF_MSG_UNSPLIT_THP THP 拆分失败;杀任务
19 MF_MSG_UNKNOWN 未知;仅记录

action_result

名称 含义
0 MF_IGNORED 无法处理,忽略
1 MF_FAILED 处理失败;可能需 panic
2 MF_DELAYED 延迟处理
3 MF_RECOVERED 成功恢复

9. devlink 健康事件

ras-devlink-handler.c

事件 字段 典型系统影响
net:net_dev_xmit_timeout driver, name, queue NIC TX 队列 hang;触发 NIC reset
devlink:devlink_health_report bus_name, dev_name, driver_name, reporter_name, msg 驱动通过 devlink reporter 上报健康/RAS 事件;reporter_name 标识哪个 reporter 触发(如 mlx5 tx/fw/hw_err)

没有固定错误码 — reporter 内容由驱动定义。


10. diskerror 错误码

ras-diskerror-handler.c — 消费 block:block_rq_error

errno 名称 系统影响
-EOPNOTSUPP operation not supported 块设备不支持该 op
-ETIMEDOUT timeout IO 超时;重试/换盘
-ENOSPC critical space allocation 精简配置空间耗尽
-ENOLINK recoverable transport SAS/FC 链路错误,可恢复
-EREMOTEIO critical target SCSI 目标严重错误;failover
-EBADE critical nexus I_T nexus 严重错误
-ENODATA critical medium 磁盘介质错误(坏扇区);重映射;换盘
-EILSEQ protection T10 PI / DIF 保护错误
-ENOMEM kernel resource 内核分配器失败
-EBUSY device resource 设备资源耗尽
-EAGAIN nonblocking retry 非阻塞重试
-EREMCHG dm internal retry device-mapper 内部重试
-EIO I/O error 通用 IO 失败

11. signal 错误码(ras-signal-handler.c

SIGBUS codes:

code 名称 含义
1 BUS_ADRALN 地址对齐无效
2 BUS_ADRERR 物理地址不存在
3 BUS_OBJERR 对象特定硬件错误
4 BUS_MCEERR_AR 硬件内存错误已消费(action required);杀进程 + 页离线
5 BUS_MCEERR_AO 硬件内存错误已发现但未消费;可选恢复

signal:signal_generate 结果:

名称 含义
0 TRACE_SIGNAL_DELIVERED 已投递
1 TRACE_SIGNAL_IGNORED 被忽略
2 TRACE_SIGNAL_ALREADY_PENDING 已 pending,no-op
3 TRACE_SIGNAL_OVERFLOW_FAIL 队列满
4 TRACE_SIGNAL_LOSE_INFO siginfo 丢失

12. reri(RISC-V RAS Error Report Register Interface)

ras-reri-handler.hRERI_EC_* 错误码:

代码 名称 类别 系统影响
0 RERI_EC_NONE
1 RERI_EC_OUE Unknown 未指定错误
2 RERI_EC_CDA Cache 损坏数据访问
3 RERI_EC_CBA Cache Cache 块数据错误
4 RERI_EC_CSD Cache Cache 巡检发现
5 RERI_EC_CAS Cache Cache 地址/状态错误
6 RERI_EC_CUE Cache Cache 未指定错误
7 RERI_EC_SDC Microarchitecture 侦听/目录地址/控制状态错
8 RERI_EC_SUE Unknown 侦听/目录未指定
9 RERI_EC_TPD TLB TLB/页表 cache 数据错
10 RERI_EC_TPA TLB TLB/页表地址控制状态
11 RERI_EC_TPU TLB TLB/页表未知
12 RERI_EC_HSE Microarchitecture Hart 状态错
13 RERI_EC_ICS Unknown 中断控制器状态错
14 RERI_EC_ITD Microarchitecture 互连数据错
15 RERI_EC_ITO Microarchitecture 互连其他
16 RERI_EC_IWE Microarchitecture 内部 watchdog 错
17 RERI_EC_IDE Microarchitecture 内部数据通路/内存/执行单元
18 RERI_EC_SBE Bus 系统内存命令/地址总线错
19 RERI_EC_SMU Microarchitecture 系统内存未指定
20 RERI_EC_SMD Microarchitecture 系统内存数据错
21 RERI_EC_SMS Microarchitecture 系统内存巡检发现
22 RERI_EC_PIO Microarchitecture 协议错非法 IO
23 RERI_EC_PUS Microarchitecture 协议错意外状态
24 RERI_EC_PTO Microarchitecture 协议错超时
25 RERI_EC_SIC Microarchitecture 系统内部控制器
26 RERI_EC_DPU Unknown 延迟错误 passthrough 不支持
27 RERI_EC_PCX Unknown PCI/CXL 检测到错误

Transaction types (TT):0=Unspecified / 1=Custom / 4=Explicit Read / 5=Explicit Write / 6=Implicit Read / 7=Implicit Write

Address info types (AIT):0=None / 1=SPA (Supervisor Physical) / 2=GPA / 3=VA

Source types:0=CPU / 1=IOMMU / 2=Unknown

严重性推导UEC→FATAL, UED→RECOVERABLE, CE→CORRECTED, else INFORMATIONAL

行为:CPU FATAL/RECOVERABLE 触发 ras_record_cpu_error(hart_id)(需 HAVE_CPU_FAULT_ISOLATION);RECOVERABLE+ 还会上报 ABRT


13. erst(APEI ERST)— MCE 重放

ras-erst.c

  • 消费 /sys/fs/pstore/erst/mce-erst* 文件
  • 通过现有 MCE handler(Intel parse_intel_event、AMD K8 parse_amd_k8_event、AMD SMCA parse_amd_smca_event)解码
  • 发射合成 mce_erst_record 事件
  • ERST_DELETE=1 时删除文件
  • 不支持 CPER generic / AER 等其他 APEI ERST record 类型

14. 全局 syslog 严重性映射

事件类别 日志级别
MCE Uncorrected / Deferred LOG_CRIT
MCE Fatal LOG_CRIT
AER Corrected LOG_ERR
AER Uncorrected Non-Fatal LOG_CRIT
AER Uncorrected Fatal LOG_EMERG
MC Corrected LOG_ERR
MC Uncorrected / Deferred LOG_CRIT
MC Fatal LOG_EMERG
MC Info LOG_DEBUG
extlog recoverable LOG_CRIT
extlog fatal LOG_EMERG
extlog corrected LOG_ERR
extlog informational LOG_INFO
CXL Poison LOG_ERR
CXL AER UE LOG_CRIT
memory-failure LOGLEVEL_ALERT
diskerror LOG_ERR

15. 关键"是什么"vs"做什么"总结

类别 rasdaemon 记录 rasdaemon 自动动作 内核实际动作
MCE CE 自动纠正
MCE UC #MC handler(杀进程/panic)
MC CE 页面/行 PFA → sysfs 写 soft/hard_offline_page
MC UE/Fatal 触发 MC_UE_TRIGGER 脚本 MCE / SIGBUS
AER CE 触发 AER_CE_TRIGGER 脚本 AER 自动 retrain
AER UE 触发 AER_UE_TRIGGER 脚本 AER reset / hot-plug 处理
CXL Poison 内核错误处理
CXL DER CE-threshold ras_hw_threshold_pageoffline() 软离线
ARM CPU 错误计数(可选) GHES 处理(已发生)
extlog 已发生
memory-failure 触发 MEM_FAIL_TRIGGER 脚本 hwpoison 完成
devlink 驱动内部处理
diskerror 块设备重试 / 上层处理
signal SIGBUS 4/5 kill task / 投递信号
reri CPU 错误计数(可选) 内核处理(已发生)
erst ERST_DELETE=1 时删文件 启动时 pstore 重放

16. 关键文件索引

文件 行数 作用
ras-mce-handler.c 636+ MCE dispatcher, report_mce_event, ras_offline_mce_event
mce-intel.c 332+ 通用 Intel MCA 解码 + AR 状态 + memory controller
mce-intel-{nehalem,sb,ivb,haswell,broadwell-de,broadwell-epex,dunnington,knl,skylake-xeon,i10nm,granite,tulsa,p4-p6}.c 16 个 平台特定 PCU/QPI/UPI/M2M/iMC 错误码
mce-amd.c 124+ AMD 通用 + decode_amd_errcode
mce-amd-k8.c 252+ K8 北桥扩展错误
mce-amd-smca.c 998+ SMCA 全 bank 表(LS/IF/L2/DE/EX/FP/L3/CS/PIE/UMC/MA_LLC/PB/PSP/SMU/MP5/MPDMA/NBIO/PCIE/XGMI_PCS/XGMI_PHY/NBIF/SATA/USB/USR_DP/USR_CP/GMI_PCS)
mce-zhaoxin-kh50000.c 400+ Zhaoxin KH-50000 全部错误码
ras-mc-handler.c 348+ EDAC 5 个错误类型 + PFA 触发
ras-page-isolation.c 850+ 页面/行 PFA + sysfs 写
ras-aer-handler.c 349+ PCIe AER 24+ 位解码
ras-arm-handler.c 600+ ARM PEI 解码
ras-cxl-handler.c 1674+ CXL 9 个事件 + 30+ AER 子码 + 5 类 record
ras-extlog-handler.c extlog 16 个 err_type + 4 严重性
ras-non-standard-handler.c CPER 段分发
non-standard-{hisi_hip08,hisilicon,ampere,nvidia,jaguarmicro,yitian}.c 6 个 vendor 解码器
ras-memory-failure-handler.c 229+ 20 page type + 4 result
ras-devlink-handler.c devlink 2 事件
ras-diskerror-handler.c 13 errno
ras-signal-handler.c 5 SIGBUS code + 5 signal result
ras-reri-handler.c 27 RERI_EC_*
ras-erst.c MCE ERST 重放
ras-events.c 1249+ 全部事件注册和分发
ras-events.h 全部枚举(severity、page_type、event type)
ras-record.c 全部 SQLite schema

17. 调试与快速定位建议

想知道什么 看哪里
MCE 错误码 → 字符串 mce-error.c 工具(util/
CXL AER 详细位 util/ras-mc-ctl.in:1205-1357
SQLite schema ras-record.c
严重性映射 *-handler.cloglevel_str[] 数组
触发脚本 trigger.c + ras-*-handler.c *_trigger_setup
页面退役 ras-page-isolation.c + sysfs soft_offline_page
ERST 重放 /sys/fs/pstore/erst/ + ras-erst.c
ABRT 报告 ras-report.c
OpenBMC SEL unified-sel.c
Logo

AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。

更多推荐