rasdaemon-fault-codes
rasdaemon 故障码与系统影响完整参考手册
范围:rasdaemon 全量故障域(MCE / EDAC / AER / ARM / CXL / extlog / non-standard / memory-failure / devlink / diskerror / signal / reri / erst)
版本基线:当前master分支(截至 2026-06-02)
性质:错误码字典 + 系统影响分析 + 来源溯源(file:line)
0. 总览:rasdaemon 处理哪些事件
rasdaemon 是 Linux RAS(Reliability, Availability, Serviceability) 事件的被动观察者 + 记录器:
| 域 | Trace Event | 主要源文件 |
|---|---|---|
| MCE | ras:mce_record |
ras-mce-handler.c + 16 个 mce-*.c 解码器 |
| EDAC/MC | ras:mc_event |
ras-mc-handler.c, ras-page-isolation.c |
| PCIe AER | ras:aer_event |
ras-aer-handler.c |
| ARM | ras:arm_event |
ras-arm-handler.c |
| CXL | 9 个 cxl:* 事件 |
ras-cxl-handler.c |
| extlog | ras:extlog_mem_event |
ras-extlog-handler.c |
| non-standard | 各类 CPER 段 | ras-non-standard-handler.c + 6 个 vendor 解码器 |
| memory-failure | ras:memory_failure_event |
ras-memory-failure-handler.c |
| devlink | devlink:devlink_health_report, net:net_dev_xmit_timeout |
ras-devlink-handler.c |
| diskerror | block:block_rq_error |
ras-diskerror-handler.c |
| signal | signal:signal_generate |
ras-signal-handler.c |
| reri | riscv_reri_event |
ras-reri-handler.c |
| erst | 文件:/sys/fs/pstore/erst/mce-erst* |
ras-erst.c |
rasdaemon 不会主动 panic、reset CPU、kill 进程、offline page——这些策略全部由内核决定。rasdaemon 只做:
- 解析事件
- 输出到 syslog / 控制台
- 写入 SQLite(若
--enable-sqlite3) - 上报 ABRT(若
--enable-abrt-report) - 触发用户脚本(
AER_CE_TRIGGER/AER_UE_TRIGGER/MC_CE_TRIGGER/MC_UE_TRIGGER/MEM_FAIL_TRIGGER) - 物理页面退役(仅 EDAC CE / CXL DER CE-threshold,写
/sys/devices/system/memory/{soft,hard}_offline_page)
1. MCE(Machine Check Exception)— 全量错误码
1.1 通用 MCA 架构错误码(所有 Intel CPU 共享)
定义:mce-intel.c:98-106 (mca_msg[])
| MCA Code | 名称 | 系统影响 |
|---|---|---|
| 0x0000 | No Error | 无 |
| 0x0001 | Unclassified | 未分类错误,通常是内部硬件故障 |
| 0x0002 | Microcode ROM Parity Error | 微码 ROM 奇偶校验错,可能需要更新微码;通常为 UC+AR,会触发包重置 |
| 0x0003 | External Error | 来自外部引脚的 MCE(pin 触发),来源需要查硬件 |
| 0x0004 | FRC (Functional Redundancy Check) Error | FRC 主备不一致,可恢复 |
| 0x0005 | Internal Parity Error | 内部奇偶校验错(cache/tag/buffer),通常 UC+PCC |
| 0x0006 | SMM Handler Code Access Violation | SMM 处理器访问违规(严重 SMM 漏洞) |
MCA 前缀分类(mce-intel.c:221-286):
| 前缀 | 类别 | 含义 |
|---|---|---|
0x0xxx |
上表 7 项 | 基础架构错误 |
| bit 12 set | 软提示 | “corrected filtering: same region has more unreported errors” |
0x0C..0x0F |
通用内存层级错误 | LL0…LL3 通用内存错误 |
0x4xxx |
TLB 错误 | TT (Instruction/Data/Generic) + LL (L0…L3) |
0x8xxx..0xBFFF |
Cache 层级错误 | TT/LL + RRRR (Read/Write/Instruction-Fetch/Prefetch/Eviction/Snoop 等) |
0xA0xx |
内部未分类 | 0xA00 = Internal Timer error |
0xBxxx |
总线/互连错误 | LL + PP (Local-CPU-originated/Responded/Third-party/Generic) + RRRR + II + T |
0x7xxx |
内存控制器错误 | decode_memory_controller() 处理 |
AR 状态位(status[55:56])→ arstate[](mce-intel.c:114-119):
| 状态位 | 名称 | 含义 | 系统影响 |
|---|---|---|---|
S=0, AR=0 |
UCNA | Uncorrected, No Action | 系统继续,但需记录 |
S=0, AR=1 |
AR | Uncorrected, Action Required | 需 reset/重启 |
S=1, AR=0 |
SRAO | Software Recoverable, Action Optional | 上下文可能可恢复 |
S=1, AR=1 |
SRAR | Software Recoverable, Action Required | RIPV=1,软件可继续 |
全局状态位(decode_mci(),mce-intel.c:299-332):
| 状态位 | 名称 | 含义 |
|---|---|---|
| bit 63 | VAL | 寄存器有效(无效则 MCE_INVALID) |
| bit 62 | OVER | Error overflow(更多错误被丢) |
| bit 61 | UC | Uncorrected(1=UC,0=CE) |
| bit 60 | EN | Error enabled |
| bit 57 | PCC | Processor Context Corrupt |
| bit 56 | S | Signaling(IP 可重启) |
| bit 55 | AR | Action Required |
| bit 54-53 | 跟踪颜色 | green=正常 / yellow=“Large number of corrected cache errors” 预警 / res3 |
| bit 44 | Deferred(AMD) | 延迟错误 |
1.2 特殊 bank:thermal / timeout
mce-intel.c:152-192:
| Bank | 错误 | 描述 | 系统影响 |
|---|---|---|---|
| 128+0 (THERMAL) | bit 0 set | “Processor N heated above trip temperature. Throttling enabled” | CPU 降频;性能影响;“请检查系统散热” |
| 128+0 (THERMAL) | bit 0 clear | “below trip temperature. Throttling disabled” | 恢复正常 |
| 128+90 (TIMEOUT) | “Timeout waiting for exception on other CPUs” | 远端 CPU 出现致命 MCE 但未上报 | 致命 |
1.3 Intel 各代平台错误码
1.3.1 Nehalem 内部错误(mce-intel-nehalem.c:75-83)
| MSC | 名称 | 影响 |
|---|---|---|
| 0x00 | No Error | 无 |
| 0x03 | Reset firmware did not complete | 启动失败 |
| 0x08 | Received an invalid CMPD | 包损坏 |
| 0x0A | Invalid Power Management Request | 电源管理请求非法 |
| 0x0D | Invalid S-state transition | 休眠状态机异常 |
| 0x11 | VID controller does not match POC controller selected | 电压控制器不匹配 |
| 0x1A | MSID from POC does not match CPU MSID | 平台 ID 不匹配 |
1.3.2 Sandy Bridge PCU bank 4(mce-intel-sb.c:16-46)
| MSC | 名称 |
|---|---|
| 0x0D | MC_IMC_FORCE_SR_S3_TIMEOUT |
| 0x0E | MC_MC_CPD_UNCPD_ST_TIMEOUT |
| 0x0F | MC_PKGS_SAFE_WP_TIMEOUT |
| 0x43 | MC_PECI_MAILBOX_QUIESCE_TIMEOUT |
| 0x5C | MC_MORE_THAN_ONE_LT_AGENT |
| 0x60…0x64 | MC_INVALID_PKGS_REQ_PCH/QPI/RES/PKGC_RES_PCH/STATE_CONFIG |
| 0x70…0x72 | MC_WATCHDG_TIMEOUT_PKGC_SLAVE/MASTER/PKGS_MASTER |
| 0x7A | MC_HA_FAILSTS_CHANGE_DETECTED |
| 0x81 | MC_RECOVERABLE_DIE_THERMAL_TOO_HOT |
1.3.3 Sandy Bridge / IVB 内存控制器 bank 8-11(mce-intel-sb.c:54-62)
| 位 | 名称 |
|---|---|
| 0x001 | Address parity |
| 0x002 | HA Wrt buffer data parity |
| 0x004 | HA Wrt byte enable parity |
| 0x008 | Corrected patrol scrub |
| 0x010 | Uncorrected patrol scrub |
| 0x020 | Corrected spare |
| 0x040 | Uncorrected spare |
IVB 还多 0x080 “Corrected memory read error” / 0x100 “iMC, WDB, parity errors”(mce-intel-ivb.c:59-69)。
1.3.4 Haswell QPI bank 5/20/21(mce-intel-haswell.c:69-83)
| MSC | 名称 |
|---|---|
| 0x02 | QPI physical layer detected drift buffer alarm |
| 0x03 | QPI physical layer detected latency buffer rollover |
| 0x10 | QPI link layer detected control error from R3QPI |
| 0x11 | Rx entered LLR abort state on CRC error |
| 0x12 | Unsupported or undefined packet |
| 0x13 | QPI link layer control error |
| 0x15 | RBT used un-initialized value |
| 0x20 | QPI in-band reset but aborted initialization |
| 0x21 | Link failover data self healing |
| 0x22 | Phy detected in-band reset (no width change) |
| 0x23 | Link failover clock failover |
| 0x30 | Rx detected CRC error - successful LLR after Phy re-init |
| 0x31 | Rx detected CRC error - successful LLR without Phy re-init |
1.3.5 Skylake UPI bank 5/12/19(mce-intel-skylake-xeon.c:65-102)
| MSC | 名称 | 类别 |
|---|---|---|
| 0x00 | UC Phy Initialization Failure | UC |
| 0x01 | UC Phy detected drift buffer alarm | UC |
| 0x02 | UC Phy detected latency buffer rollover | UC |
| 0x10 | UC LL Rx detected CRC error: unsuccessful LLR | UC |
| 0x11 | UC LL Rx unsupported or undefined packet | UC |
| 0x12 | UC LL or Phy control error(再细分到 upi_0x12[]) |
UC |
| 0x13 | UC LL Rx parameter exchange exception | UC |
| 0x1F | UC LL detected control error from link-mesh interface | UC |
| 0x20 | COR Phy initialization abort | CE |
| 0x21 | COR Phy reset | CE |
| 0x22 | COR Phy lane failure, recovery in x8 width | CE |
| 0x23 | COR Phy L0c error corrected without Phy reset | CE |
| 0x24 | COR Phy L0c error triggering Phy Reset | CE |
| 0x25 | COR Phy L0p exit error corrected with Phy reset | CE |
| 0x30 | COR LL Rx detected CRC error - successful LLR without Phy Reinit | CE |
| 0x31 | COR LL Rx detected CRC error - successful LLR with Phy Reinit | CE |
1.3.6 Skylake M2M bank 7/8(mce-intel-skylake-xeon.c:145-155)
| MSC | 名称 |
|---|---|
| 16 | MscodDataRdErr |
| 18 | MscodPtlWrErr |
| 19 | MscodFullWrErr |
| 20 | MscodBgfErr |
| 21 | MscodTimeout |
| 22 | MscodParErr |
| 23 | MscodBucket1Err |
1.3.7 Icelake iMC 6 个 code page(mce-intel-i10nm.c:155-246)
主要 page 0 (imc_0[]):
| MSC | 名称 |
|---|---|
| 0x01 | Address parity error |
| 0x02 | Data parity error |
| 0x03 | Data ECC error |
| 0x07 | Transaction ID parity error |
| 0x08 | Corrected patrol scrub error |
| 0x10 | Uncorrected patrol scrub error |
| 0x20 | Corrected spare error |
| 0x40 | Uncorrected spare error |
| 0x80 | Corrected read error |
| 0xA0 | Uncorrected read error |
| 0xC0 | Uncorrected metadata |
Page 2 (imc_2[]): DDR4/HBM 命令地址奇偶校验。Page 8 (imc_8[]): DDR-T 调度器/CMI/TME 错误集。
1.3.8 Granite Rapids MCCHAN(mce-intel-granite.c:107-157)
7 个 page,针对 13-24 bank:
| Page | 典型错误 |
|---|---|
| 0 | Address Parity、CMI Wr data/BE/MAC parity、Patrol/Spare、Demand/Underfill Read 错误、Poison 读取、Read 2LM MetaData |
| 1 | WDB Read Parity/ECC/BE、DDR Link Fail、Illegal opcode |
| 2 | DDR CAParity or WrCRC |
| 4 | Scheduler address parity |
| 8 | MC Internal Errors、MCTracker Address RF parity |
| 32, 33 | sCH1 对应 page 0/1 的副本 |
1.3.9 跨代通用:PCU MCA 0x402/0x403/0x406/0x407
Haswell 起所有 Intel CPU 都有:
| MCA | 名称 |
|---|---|
| 0x402 | PCU Internal Errors |
| 0x403 | Other UBOX / VCU Internal Errors |
| 0x406 | Intel TXT Errors |
| 0x407 | Other UBOX Internal Errors |
1.4 AMD K8 北桥扩展错误(mce-amd-k8.c:71-91)
bank 4 (Northbridge) status[16:19]:
| ExtErr | 名称 | 严重性 | 系统影响 |
|---|---|---|---|
| 0 | RAM ECC error | CE/UE | DRAM 数据错误 |
| 1 | CRC error | CE | 链路 CRC,可重试 |
| 2 | Sync error | UE | 链路协议错 |
| 3 | Master abort | UE | 总线无响应 |
| 4 | Target abort | UE | 目标设备错误 |
| 5 | GART error | CE | 图形地址转换表(K8 已忽略) |
| 6 | RMW error | CE | 读改写失败 |
| 7 | Watchdog error | UE | 链路 watchdog |
| 8 | RAM Chipkill ECC error | CE | 多符号 DRAM ECC |
| 9 | DEV Error | UE | 设备错误 |
| 10 | Link Data Error | CE | HT 链路数据 |
| 11 | Link Protocol Error | UE | HT 协议错 |
| 12 | NB Array Error | CE | 北桥阵列 |
| 13 | DRAM Parity Error | UE | DRAM 奇偶 |
| 14 | Link Retry | CE | 链路重试 |
| 15 | Table Walk Data Error | UE | 页表遍历数据 |
| 16 | L3 Cache Data Error | CE | L3 数据 |
| 17 | L3 Cache Tag Error | CE | L3 标签 |
| 18 | L3 Cache LRU Error | CE | L3 LRU 位 |
K8 通用模式(decode_k8_generic_errcode()):
(ec & 0xfff0) == 0x0010→ LB/TLB 错误(TT + Level)(ec & 0xff00) == 0x0100→ Memory/Cache(memtrans + tx + level)(ec & 0xf800) == 0x0800→ Bus(PP + TO + memtrans + level)
K8 状态高位:highbits[] 包含 bit 29 UC, bit 25 PCC, bit 14 CE, bit 13 UE, bit 8 scrub 发现。
1.5 AMD SMCA 错误族(Zen 及以后)
SMCA 通过 smca_hwid_mcatypes[](mce-amd-smca.c:730-804)将 24 类 bank 路由到对应表。下表汇总所有 SMCA bank 及其 XEC:
1.5.1 SMCA_LS / LS_V2 — Load Store Unit
| 类别 | XEC 范围 | 代表性错误 |
|---|---|---|
| LS v1 | 0…26 | Load queue parity, Store queue parity, Miss address buffer payload parity, L1 TLB parity, DC Tag error 1-6, Internal error 1-2, Sys Read data error, DC data error 1-3 (poison consumption), L2 TLB parity, PDC parity, L2 fill data, SCB cache state/address/data error, WCB SystemReadDataError, Hardware Asserts |
| LS v2 | 0…23 | 同上 + ECC/poison 探针/受害者/load/RMW store;EMEM 读 ECC;TLB1/2/PWC/STQ/LDQ/MAB/SCB entry/WCB/SRB/EMEM 数据掩码 parity;poisoned line in SCB |
1.5.2 SMCA_IF — Instruction Fetch(XEC 0…18)
microtag probe port parity, IC microtag/full tag multi-hit, IC full tag/data array parity, PRQ parity, L0/L1/L2 ITLB parity, BPQ snoop parity thread 0/1, BP L1/L2 BTB multi-hit, L2 Cache Response Poison, System Read Data error, Hardware Assertion, L1/L2-TLB Multi-Hit, BSR Parity, CT MCE。
1.5.3 SMCA_L2_CACHE(XEC 0…5)
L2M Tag Multiple-Way-Hit, L2M Tag/State Array ECC, L2M Data Array ECC, Hardware Assert, SDP Read Response Parity, programmable state machine error。
1.5.4 SMCA_DE — Decoder(XEC 0…9)
μop cache tag/data array parity, IBB Register File, μop queue, instruction dispatch queue, fetch address FIFO, Patch RAM data/sequencer, μop fetch queue, Hardware Assertion MCA。
1.5.5 SMCA_EX — Execution(XEC 0…13)
Watchdog timeout, PRF parity, flag register parity, immediate displacement register parity, AG payload, EX payload, checkpoint queue, retire dispatch/status queue, scheduler queue, branch buffer queue, Hardware Assertion, Spec/Retire Map parity。
1.5.6 SMCA_FP — Floating Point(XEC 0…7)
PRF parity, Freelist parity, schedule queue, NSQ, retire queue, status register file, hardware assertion, physical K mask register file。
1.5.7 SMCA_L3_CACHE(XEC 0…9)
Shadow tag macro ECC/multi-way-hit, L3M tag ECC/multi-way-hit, L3M data ECC, SDP Parity from XI, L3 victim queue Data Fabric error, Hardware Assertion, XI WCB Parity Poison Creation, DSM action。
1.5.8 SMCA_CS / CS_V2 / CS_V2_QUIRK — Coherent Slave
| 变体 | XEC | 错误 |
|---|---|---|
| CS v1 | 0…8 | Illegal request, Address violation, Security violation, Illegal response, Unexpected response, Request/Probe Parity, Read Response Parity, Atomic request parity, Probe Filter ECC |
| CS v2 | 0…20 | CS v1 + SDP read response no match/Unexpected RETRY、Counter over/underflow、no-data channel 上的 Illegal/Address/Security violation、Hardware Assert、Shadow Tag Protocol/ECC/Transaction Error |
| CS v2 QUIRK (Genoa erratum 1384) | 0…17 | CS v2 重新排序,去掉 shadow tag 相关项 |
1.5.9 SMCA_PIE(XEC 0…8)
Hardware assert, Register security violation, Link error, Poison data consumption, deferred error detected in DF, Watch Dog Timer, SRAM ECC in CNLI, Register access during DF Cstate, DSM Error。
1.5.10 SMCA_UMC / UMC_QUIRK / UMC_V2 — Unified Memory Controller
| 变体 | XEC | 错误 |
|---|---|---|
| UMC | 0…16 | DRAM ECC, Data poison on DRAM, SDP parity, Advanced peripheral bus error, Command/address parity, Write data CRC, DCQ SRAM ECC, AES SRAM ECC, ECS Row Error, ECS Error, UMC Throttling, Read CRC, RFM SRAM ECC |
| UMC_QUIRK (Turin X3D) | 0…15 | DRAM On Die ECC, Data poison, SDP parity, Address/Command parity, HBM Write data parity, Consolidated SRAM ECC, Rdb SRAM ECC, Thermal throttling, HBM Read Data Parity, UMC FW Error, SRAM Parity, HBM CRC |
| UMC_V2 (HBM) | 0…11 | DRAM ECC, Data poison, SDP parity, Address/Command parity, Write data parity, DCQ SRAM ECC, Read data parity, Rdb SRAM ECC, RdRsp SRAM ECC, LM32 MP errors |
1.5.11 SMCA_MA_LLC(XEC 0…6)
Counter over/underflow, Write Data Parity, Read Response Parity, Cache Tag ECC Macro 0/1, Cache Data ECC。
1.5.12 SMCA_PB(XEC 0)
Parameter Block RAM ECC。
1.5.13 SMCA_PSP / PSP_V2
| 变体 | XEC | 错误 |
|---|---|---|
| PSP v1 | 0 | PSP RAM ECC/parity |
| PSP v2 | 0…25 | High/Low SRAM ECC/parity, Instruction Cache Bank 0/1 ECC/parity, Instruction Tag Ram 0/1 parity, Data Cache Bank 0…3 ECC/parity, Data Tag Bank 0…3 parity, Dirty Data Ram parity, TLB Bank 0/1 parity, System Hub Read Buffer ECC/parity, FUSE IP SRAM ECC/parity, PCRU FUSE SRAM ECC/parity, SIB SRAM parity, mpASP SECEMC, mpASP A5 Hang, SIB WDT |
1.5.14 SMCA_SMU / SMU_V2
| 变体 | XEC | 错误 |
|---|---|---|
| SMU v1 | 0 | SMU RAM ECC/parity |
| SMU v2 | 0…11, 58…61 | High/Low SRAM ECC, Data Cache/Tag Bank A/B, Instruction Cache/Tag Bank A/B, System Hub Read Buffer, PHY RAS ECC + GFX Sub-IP CE/fatal/poison/other |
1.5.15 SMCA_MP5(XEC 0…10)
High/Low SRAM ECC, Data Cache/Tag Bank A/B, Instruction Cache/Tag Bank A/B, Fuse SRAM ECC。
1.5.16 SMCA_MPDMA(XEC 0…50)
Main SRAM [31:0]/[63:32]/[95:64]/[127:96] bank ECC/parity;Data/Instruction Cache/Tag Bank A/B;System Hub Read Buffer ECC;MPDMA TVF DVSEC Memory ECC;TVF MMIO Mailbox0/1;Doorbell Memory;SDP Slave/Master Memory 0…7;SDP Watchdog Timer;PTE Command FIFO;Hub Data FIFO;Internal Data FIFO;Command Memory DMA/Internal。
1.5.17 SMCA_NBIO(XEC 0…5)
ECC/Parity, PCIE error, External SDP ErrEvent, SDP Egress Poison, Internal Poison, Internal system fatal。
1.5.18 SMCA_PCIE / PCIE_V2
| 变体 | XEC | 错误 |
|---|---|---|
| PCIE | 0…4 | CCIX PER Message logging, CCIX Read/Write Response Non-Data Error, CCIX Read Response Data Error, CCIX Non-okay write response with data error |
| PCIE v2 | 0 | SDP Data Parity Error |
1.5.19 SMCA_XGMI_PCS(XEC 0…28)
Data Loss, Training, Flow Control Acknowledge, Rx Fifo Underflow/Overflow, CRC, BER Exceeded, Tx Vcid Data, Replay Buffer Parity, Data Parity, Replay Fifo Overflow/Underflow, Elastic Fifo Overflow, Deskew, Flow Control CRC, Data Startup Limit, FC Init Timeout, Recovery Timeout, Ready Serial Timeout/Attempt, Recovery Attempt/Relock Attempt, Replay Attempt, Sync Header, Tx/Rx Replay Timeout, LinkSub Tx/Rx Timeout, Rx CMD Pocket。
1.5.20 SMCA_XGMI_PHY / WAFL_PHY / GMI_PHY(XEC 0…3)
RAM ECC, ARC instruction buffer parity, ARC data buffer parity, PHY APB error。
1.5.21 SMCA_NBIF / SHUB(XEC 0…3)
Timeout from GMI, SRAM ECC, NTB Error Event, SDP Parity。
1.5.22 SMCA_SATA(XEC 0…7)
Port 0…7 奇偶校验错。
1.5.23 SMCA_USB(XEC 0…5)
S0 RAM0/1/2, PHY RAM0/1 奇偶/ECC, AXI Slave Response 错误。
1.5.24 SMCA_USR_DP(XEC 0…30)
Mst CMD/Rx FIFO/Deskew/Detect Timeout/FlowControl/DataValid FIFO、Mac LinkState、Deskew、Init Timeout/Attempt、Recovery Timeout/Attempt、Eye Training Timeout、Data Startup Limit、LS0 Exit、PLL powerState Update Timeout、Rx FIFO、Lcu、Conv CECC/UECC、Rx DataLoss、Replay CECC/UECC、CRC、BER Exceeded、FC Init Timeout/Attempt、Replay Timeout/Attempt、Replay Underflow/Overflow。
1.5.25 SMCA_USR_CP(XEC 0…12)
Packet Type, Rx FIFO, Deskew, Rx Detect Timeout, Data Parity, Data Loss, Lcu, HB1/HB2 Handshake Timeout, Clk Sleep/Wake Rsp Timeout, Reset Attack, Remote Link Fatal。
1.5.26 SMCA_GMI_PCS(XEC 0…31)
Data Loss, Training, Replay Parity, Rx Fifo Underflow/Overflow, CRC, BER Exceeded, Tx Fifo Underflow, Replay Buffer Parity, Tx Overflow, Replay Fifo Overflow/Underflow, Elastic Fifo Overflow, Deskew, Offline, Data Startup Limit, FC Init Timeout, Recovery Timeout, Ready Serial Timeout/Attempt, Recovery Attempt, Recovery Relock Attempt, Deskew Abort, Rx Buffer, Rx LFDS Fifo Overflow/Underflow, LinkSub Tx/Rx Timeout, Rx CMD Packet, LFDS Training/FC Init Timeout, Data Loss。
1.5.27 AMD 通用错误分类(decode_amd_errcode(),mce-amd.c:68-123)
| 模式 | 含义 | 分类 |
|---|---|---|
| UC + PCC | Processor context corrupt | System Fatal error |
| UC + RIPV | Restartable | Uncorrected, software restartable |
| UC only | Containable | Uncorrected, software containable |
| DEFERRED | 平台延迟错误 | Deferred error, no action required |
| 其他 | Corrected | Corrected error, no action required |
| POISON | 读到已被下毒数据 | Poison consumed |
| TCC | Task context corrupt | Task_context_corrupt |
通用模式宏(mce-amd.c):
INT_ERROR(x) = ((x) & 0xF4FF) == 0x0400→ “Internal ‘reserved/hardware assert/reserved/reserved’”TLB_ERROR(x) = ((x) & 0xFFF0) == 0x0010→ TLB tx/levelMEM_ERROR(x) = ((x) & 0xFF00) == 0x0100→ Memory mem-tx/tx/levelBUS_ERROR(x) = ((x) & 0xF800) == 0x0800→ Bus PP/TO/mem-tx/level
UMC 位置解码(decode_smca_error()):Family 0x19, Model 0x90…0x9f → memory_die_id;否则 channel = IPID >> 20,csrow = synd & 0x7。
1.6 Zhaoxin KH-50000 错误码(mce-zhaoxin-kh50000.c)
| Bank 类型 | 子错误来源 | 错误码 | 名称 |
|---|---|---|---|
| CPU bank | status[25:29] |
0,1 | Unknown |
| CPU bank | 2 | Machine hung error(致命) | |
| CPU bank | 3 | Undefined ucode address error | |
| Cache (PL2 / LLC) | status[24:25] 或 status[25:26] |
0 | Unknown |
| Cache | 1 | ECC single bit error for data part in the same line | |
| Cache | 2 | ECC single bit error for different line | |
| Cache | 3 | ECC multi bit error for data part | |
| PCIe | status[16:23] |
0 | Fatal |
| PCIe | 1 | Non-fatal | |
| PCIe | 2 | Correctable | |
| IOD ZDI / ZPI | XEC | 0…23 | Receiver overflow (TL), FC protocol, Surprise down, DLL protocol, Replay timer timeout, REPLAY_NUM Rollover, Bad DLLP, Bad TLP, Receiver error (PHY), Phy training, Link-width down-mode (X32X24/X16X12/X8/X4/X2), Link-speed down-mode (GEN4/GEN3/GEN2) |
| CCD ZDI | XEC | 0…17 | Receive overflow, PHY training, FC protocol, Surprise down, DLLM protocol, DLLM replay timeout, DLLM replay number rollover, Bad DLLP, Bad TLP, Gen2/3/4 unreliable, X2/X4/X8/X16X12/X32X24 unreliable |
| SVID | status[24:31] |
0 | No error |
| SVID | 1 | SVID Resend fail | |
| SVID | 2 | VRM Over current | |
| SVID | 3 | VRM Over temp | |
| SVID | 4 | VRM Parity | |
| DRAM | mca_err_code |
0x1 | DVAD Error |
| DRAM | 0x5 | Parity Error | |
| DRAM | mem_err_code |
0 | Generic undefined request |
| DRAM | 1 | Memory read | |
| DRAM | 2 | Memory write | |
| DRAM | 3 | Address/Command | |
| DRAM | 4 | Memory scrubbing | |
| DRAM | 5 | Data poison enable (DRAM, master normal read) | |
| DRAM | 6 | Data poison enable (DRAM, patrol read) | |
| DRAM | 7 | Key hit error | |
| DRAM | mem_specific |
0 | Unknown |
| DRAM | 1 | Single bit ECC | |
| DRAM | 2 | Multiple bit ECC | |
| DRAM | 3 | Command parity | |
| DRAM | 4 | CRC | |
| DRAM | 5 | Parity retry failed | |
| DRAM | 6 | CRC retry failed | |
| DRAM | 7…15 | DVAD decode (CPUIF CHA0/CHB0/CHA1/CHB1, MCUTRF, GMINT…) | |
| HIF (CXL) | status[16:23] |
0 | Unknown |
| HIF | 1 | HIF dvad error | |
| HIF | 2 | SNT multi bit ecc | |
| HIF | 3 | SNT single bit ecc | |
| HIF | 4 | CXL decpoison uc | |
| HIF | 5 | CXL decpoison ce | |
| HIF | 6 | CXL parity |
1.7 MCE 的系统影响分类
| 类别 | 出现条件 | 实际系统行为 |
|---|---|---|
| Cache (CE) | RR=10, UC=0 | 透明纠正;用户无感知;green 跟踪;多次同 line 触发 → yellow 预警 |
| Cache (UC) | UC=1 | #MC → 内核杀进程或 panic |
| TLB (UC + PCC) | 0x4xxx | 上下文损坏 → panic |
| Bus / 互连 (UC) | 0xBxxx | 可能污染其他核状态,几乎必致命 |
| Internal (UC) | 0x0400 / 0xA0xx | 未分类硅内部错误;AMD SMCA “Hardware Assert” 跨多 bank |
| Memory (CE) | 0x7xxx / 0x0100 | DRAM 单比特纠错,透明 |
| Memory (UE) | UC=1 | 内核 MCE → SIGBUS / kill;可触发镜像切换 |
| Deferred (AMD) | bit 44 | 平台延迟;消费时才暴露 |
| Poison Consumed (AMD) | bit 43 | CPU 读到被下毒数据 → 内核下毒页 |
2. EDAC/MC(Memory Controller)错误
2.1 错误类型枚举(enum hw_event_mc_err_type,ras-events.h:78-84)
| Enum | 整数 | 文本 | 日志级别 | 系统影响 | rasdaemon 行为 |
|---|---|---|---|---|---|
HW_EVENT_ERR_CORRECTED |
0 | Corrected | LOG_ERR | DRAM 单比特纠错,透明;CE 风暴预警 UC | 插入 SQLite + 速率统计 + 页面 PFA(可选)+ 触发 MC_CE_TRIGGER |
HW_EVENT_ERR_UNCORRECTED |
1 | Uncorrected | LOG_CRIT | 数据损坏 → MCE / SIGBUS | 插入 SQLite + 触发 MC_UE_TRIGGER;不自动 retire |
HW_EVENT_ERR_DEFERRED |
2 | Deferred | LOG_CRIT | 数据损坏不确定(可能没消费) | 插入 SQLite;不触发脚本 |
HW_EVENT_ERR_FATAL |
3 | Fatal | LOG_EMERG | 系统完整性破坏 | 插入 SQLite;不 retire |
HW_EVENT_ERR_INFO |
4 | Info | LOG_DEBUG | 无实际错误(init/scrub) | 插入 SQLite,无其他动作 |
2.2 位置粒度
ras-mc-handler.c:240-264 解析 (top_layer, middle_layer, lower_layer) 三元组,覆盖:
- chip-select / row / bank / DIMM / channel
- 取决于 EDAC 驱动(i10nm、amd64、skx、sb-edac 等)
2.3 页面 / 行退役机制(ras-page-isolation.c)
| 策略 | 默认 | 触发条件 | 写入 sysfs |
|---|---|---|---|
| 页面 CE PFA | 软离线 | 4 KiB 页 CE 数 ≥ PAGE_CE_THRESHOLD(默认 50)在 PAGE_CE_REFRESH_CYCLE(24h)窗口 |
/sys/devices/system/memory/soft_offline_page |
| 行 CE PFA | 关闭 | 整行 CE 数 ≥ ROW_CE_THRESHOLD(默认 50) |
同上 |
| 模式 | soft |
`PAGE_CE_ACTION=off | account |
关键事实:rasdaemon 不直接调 hwpoison;它只写 sysfs,由内核 mm/memory-failure.c 实际下毒页。
3. PCIe AER 错误码
ras-aer-handler.c:27-60 定义了 aer_cor_errors[](aer_uncor_errors[] 同源,bit 编号相同)。bitfield_msg() 将状态寄存器翻译为文本。
3.1 Correctable Errors(CORR_ERR_STATUS)
| 位 | 名称 | 系统影响 | AER 处理 |
|---|---|---|---|
| 0 | Receiver Error | PHY 恢复错误,链路 retrain;无数据丢失 | 自动 |
| 6 | Bad TLP | TLP CRC/ECRC 错,drop 后重放;频率高则吞吐下降 | 自动重放 |
| 7 | Bad DLLP | DLLP CRC 错,重试 | 自动重试 |
| 8 | RELAY_NUM Rollover | 多次 TLP 重放;信号完整性差 | 自动 |
| 12 | Replay Timer Timeout | TLP 重传超时;链路 retrain;瞬时停顿 | 自动 retrain |
| 13 | Advisory Non-Fatal | 厂商语义非致命 | 自动 |
| 14 | Corrected Internal Error | 设备内部 ECC 纠正 | 自动 |
| 15 | Header Log Overflow | 诊断信息丢失 | 自动 |
3.2 Uncorrectable Non-Fatal(UNCORR_ERR_STATUS)
| 位 | 名称 | 系统影响 |
|---|---|---|
| 4 | Data Link Protocol | 链路 L0→Recovery/Detect 异常;事务重试;可能 hang |
| 5 | Surprise Link Down | 设备热拔或物理断开;所有 in-flight 事务中止;下游设备消失 |
| 12 | Poisoned TLP | TLP 携带 EP 位(已损坏);DMA 写通知 requester 失败;读完成数据被丢;可能 EIO / page poisoning |
| 13 | Flow Control Protocol | 接收方 FC 违规;需要 retrain;可能 stall |
| 14 | Completion Timeout | 请求方超时未收到完成;UR 返回;设备可能 hang |
| 15 | Completer Abort | 目标无法完成;CA 完成;调用方 EIO |
| 16 | Unexpected Completion | 完成报文与未发请求不匹配;switch / endpoint bug |
| 17 | Receiver Overflow | 接收方 buffer 溢出;FCP / DLLP 错误;可能致命 |
| 18 | Malformed TLP | TLP 结构错;事务失败 |
| 19 | ECRC | 端到端 CRC 失败;in-flight 数据损坏;强信号 PHY 问题 |
| 20 | Unsupported Request | 设备不支持该请求类型;UR 完成 |
| 21 | ACS Violation | 访问控制服务阻断 P2P 事务;安全特性触发 |
| 22 | Uncorrected Internal | 设备内部不可纠正;可能设备失声 |
| 23 | MC Blocked TLP | 多播 TLP 被 MC 规则阻断;功能性 no-op |
| 24 | AtomicOp Egress Blocked | 原子操作 egress 阻断;请求失败 |
| 25 | TLP Prefix Blocked | 厂商前缀被丢;请求失败 |
| 26 | Poisoned TLP Egress Blocked | switch 拒绝转发下毒 TLP;requester 收 UR/CA |
3.3 Uncorrectable Fatal
rasdaemon 不单独维护 fatal 表,使用与 non-fatal 同一组 aer_uncor_errors[] 位名;fatal 判定由内核 AER 通过 uncorrectable error mask 完成。
典型 fatal 类别(PCIe 规范):Training Error、Link Below Speed、Data Link Protocol Error、Surprise Down、Receiver Overflow、Uncorrected Internal。
3.4 rasdaemon AER 行为
- 日志到 syslog(severity 映射:CE→LOG_ERR,UE Non-Fatal→LOG_CRIT,UE Fatal→LOG_EMERG)
- 状态位解码 + 可能的 TLP header
- libpci 查 vendor/device 名称
- 写入 SQLite
- ABRT 报告
- 转发 OpenBMC Unified SEL(IPMI OEM SEL)
- Ampere BMC OEM SEL
- 触发
AER_CE_TRIGGER/AER_UE_TRIGGER用户脚本
注:rasdaemon 不触发 AER recovery / slot reset,这是内核 AER 驱动(drivers/pci/pcie/aer/aerdrv.c)的职责。
4. ARM 错误码
rasdaemon 仅消费 UEFI 2.9 § N.2.4.4 Processor Error Section(PEI),不处理 SMMU / GIC / CCI / CCN 专用 APEI-GHES section。
4.1 错误类型位(ras-arm-handler.c:36-40)
| 名称 | 位 | 类别 |
|---|---|---|
ARM_CACHE_ERROR |
BIT 1 | Cache 错误 |
ARM_TLB_ERROR |
BIT 2 | TLB 错误 |
ARM_BUS_ERROR |
BIT 3 | 总线错误 |
ARM_VENDOR_ERROR |
BIT 4 | 厂商自定义(绕过标准解码) |
4.2 syndrome 字段(PEI error_info 64-bit)
| 字段 | 位 | 取值 |
|---|---|---|
| Transaction type | 16-17 | Instruction / Data Access / Generic |
| Operation type | 18-21 | Cache=11 种 / TLB=9 种 / Bus=7 种 |
| Level | 22-24 | 0-7(cache level / TLB level / 亲和性总线 level) |
| Proc context corrupt | 25 | bool |
| Corrected | 26 | bool |
| Precise PC | 27 | bool |
| Restartable PC | 28 | bool |
| Participation type | 29-30 | Local-originated/Responded/Observed/Generic |
| Time-out | 31 | bool |
| Address space | 32-33 | External / Internal / Unknown / Device Memory Access |
| Mem attributes | 34-42 | 9-bit 原始值 |
| Access mode | 43 | Normal / Secure |
4.3 标志位
| 标志 | 位 | 含义 |
|---|---|---|
| First error | 0 | 多个错误中的第一个 |
| Last error | 1 | 多个错误中的最后一个 |
| Propagated | 2 | 错误从源传播 |
| Overflow | 3 | 溢出 |
4.4 严重性(GHES_SEV_*,ras-events.h:94-99)
| 值 | 名称 | 文本 | 系统影响 |
|---|---|---|---|
| 0 | GHES_SEV_NO | Informational | 无影响 |
| 1 | GHES_SEV_CORRECTED | Corrected | 已纠正;CPU 错误计数++(HAVE_CPU_FAULT_ISOLATION) |
| 2 | GHES_SEV_RECOVERABLE | Recoverable | 已恢复;同上 |
| 3 | GHES_SEV_PANIC | Fatal | 致命;仅记录 |
4.5 厂商特殊解码
仅 Ampere 路径启用:decode_amp_payload0_err_regs()(non-standard-ampere.c,受 HAVE_AMP_NS_DECODE 控制)。其他 vendor blob 进 display_raw_data() 原样十六进制打印。
4.6 行为与限制
- 写 SQLite、ABRT
- 若
HAVE_CPU_FAULT_ISOLATION且 sev ∈ {CORRECTED, RECOVERABLE}:调用count_errors()计数并喂给ras_record_cpu_error() - 不主动 panic、杀进程、page 隔离、CPU 离线 — 全部在 rasdaemon 之前由内核完成
5. CXL 错误码
5.1 事件严重性(ras-cxl-handler.c:492-498,CXL 3.0 §8.2.9.2.2 Table 8-49)
| 名称 | 值 |
|---|---|
CXL_EVENT_TYPE_INFO |
0x00 |
CXL_EVENT_TYPE_WARN |
0x01 |
CXL_EVENT_TYPE_FAIL |
0x02 |
CXL_EVENT_TYPE_FATAL |
0x03 |
5.2 AER Uncorrectable(ras-cxl-handler.c:274-317)
| 位 | 名称 | 组件 | 系统影响 |
|---|---|---|---|
| 0 | CXL_AER_UE_CACHE_DATA_PARITY |
CXL.cache 链路 | 链路数据奇偶错;可能 poison |
| 1 | CXL_AER_UE_CACHE_ADDR_PARITY |
链路 | 错误寻址 |
| 2 | CXL_AER_UE_CACHE_BE_PARITY |
链路 | 部分写损坏 |
| 3 | CXL_AER_UE_CACHE_DATA_ECC |
设备 cache | 不可纠正 ECC;cacheline poison 传到 host |
| 4 | CXL_AER_UE_MEM_DATA_PARITY |
CXL.mem 路径 | 数据奇偶错 |
| 5 | CXL_AER_UE_MEM_ADDR_PARITY |
CXL.mem 路径 | 错向访问;跨主机污染 |
| 6 | CXL_AER_UE_MEM_BE_PARITY |
CXL.mem 路径 | 部分写损坏 |
| 7 | CXL_AER_UE_MEM_DATA_ECC |
CXL.mem | 不可纠正 ECC;永久数据丢失;host 收 poison |
| 8 | CXL_AER_UE_REINIT_THRESH |
链路 | 链路重初始化阈值;设备瞬时不可用 |
| 9 | CXL_AER_UE_RSVD_ENCODE |
链路 | 协议级错 |
| 10 | CXL_AER_UE_POISON |
设备对端 | 收到对端 poison;消费时可能 host MCE |
| 11 | CXL_AER_UE_RECV_OVERFLOW |
链路接收方 | buffer 溢出;flit 丢 |
| 14 | CXL_AER_UE_INTERNAL_ERR |
厂商定义 | 严重性 vendor 定义 |
| 15 | CXL_AER_UE_IDE_TX_ERR |
IDE 发送方 | 链路完整性/机密性受损 |
| 16 | CXL_AER_UE_IDE_RX_ERR |
IDE 接收方 | 认证/解密失败 |
5.3 AER Correctable(ras-cxl-handler.c:290-330)
| 位 | 名称 | 组件 | 系统影响 |
|---|---|---|---|
| 0 | CXL_AER_CE_CACHE_DATA_ECC |
设备 cache | cache ECC 纠正;监控速率 |
| 1 | CXL_AER_CE_MEM_DATA_ECC |
CXL.mem | 内存 ECC 纠正;趋势预警 |
| 2 | CXL_AER_CE_CRC_THRESH |
链路 | CRC 阈值;信号完整性告警 |
| 3 | CXL_AER_CE_RETRY_THRESH |
链路 | 重试阈值;延迟/性能影响 |
| 4 | CXL_AER_CE_CACHE_POISON |
设备 cache | 收到 cache poison |
| 5 | CXL_AER_CE_MEM_POISON |
CXL.mem | 收到内存 poison |
| 6 | CXL_AER_CE_PHYS_LAYER_ERR |
PHY | 对端 PHY 错误 |
5.4 General Media Event Record(GMER,CXL 3.1 §8.2.9.2.1.1 Table 8-45)
memory_event_type:
| 值 | 名称 |
|---|---|
| 0 | ECC Error |
| 1 | Invalid Address |
| 2 | Data Path Error |
| 3 | TE State Violation |
| 4 | Scrub Media ECC Error |
| 5 | Adv Prog CME Counter Expiration |
| 6 | CKID Violation |
memory_event_sub_type(0-5):Not Reported / Internal Datapath / Media Link Cmd/CTL/Dat Training / Media Link CRC
transaction_type(0-8):Unknown / Host Read / Host Write / Host Scan Media / Host Inject Poison / Internal Media Scrub / Internal Media Management / Internal Media Error Check Scrub / Media Initialization
Descriptor flags:UNCORRECTABLE(bit0)、THRESHOLD(bit1)、POISON_LIST_OVERFLOW(bit2)
DPA flags:VOLATILE(bit0)、NOT_REPAIRABLE(bit1)
5.5 DRAM Event Record(DER,CXL 3.1 §8.2.9.2.1.2 Table 8-46)
memory_event_type:0=Media ECC / 1=Scrub Media ECC / 2=Invalid Address / 3=Data Path Error / 4=TE State Violation / 5=Adv Prog CME Counter Expiration / 6=CKID Violation
DER 携带完整几何:channel / sub_channel / rank / bank_group / bank / row / column / nibble_mask / correction_mask。
DER 阈值事件是 rasdaemon 中唯一触发自动 page offline 的 CXL 路径(ras-cxl-handler.c:1244-1249 → ras_hw_threshold_pageoffline(hpa)),需要 HAVE_MEMORY_CE_PFA。
5.6 Memory Module Event Record(MMER,CXL 3.1 §8.2.9.2.1.3 Table 8-47)
event_type(0-8):Health Status / Media Status / Life Used / Temperature / Data Path / LSA / Unrecoverable Internal Sideband Bus / Memory Media FRU / Power Management Fault
event_sub_type(0-3):Not Reported / Invalid Config Data / Unsupported Config Data / Unsupported Memory Media FRU
health_status flags:MAINTENANCE_NEEDED / PERFORMANCE_DEGRADED / REPLACEMENT_NEEDED / MEM_CAPACITY_DEGRADEDmedia_status(0-7):Normal / Not Ready / Write Persistency Lost / All Data Lost / 写持久性会在掉电时丢失 / Imminent / 即将全数据丢失 / All Data Loss Imminent
5.7 Memory Sparing Event Record(MSER,CXL 3.2 §8.2.10.2.1.4 Table 8-60)
flags:QUERY_RESOURCES(BIT0) / HARD_SPARING(BIT1) / DEVICE_INITIATED(BIT2)
注意:MSER handler 不持久化到 SQLite(无 #ifdef HAVE_SQLITE3 块),是已知疏漏。
5.8 公共事件头 / Poison List
- Common Event Record Flags(每事件):
PERMANENT(2)/MAINT_NEEDED(3)/PERF_DEGRADED(4)/HW_REPLACE(5)/MAINT_OP_SUB_CLASS_VALID(6)/LD_ID_VALID(7)/HEAD_ID_VALID(8) - Poison List Event(
cxl_poison):Source ∈ {Unknown(0), External(1), Internal(2), Injected(3), Vendor(7)};FlagsMORE/OVERFLOW/SCANNING - Event Record Overflow(
cxl_overflow):log_type ∈ {Info, Warn, Failure, Fatal},丢失 N 条事件,盲区 - Generic Event Record:80 字节原始数据,hdr_uuid 标识 decoder
5.9 CXL 发现 / 拓扑
- 不遍历
/sys/bus/cxl/ - 不做内存中的 component 缓存
- 完全依赖内核在 trace payload 中填好的
memdev/host/serial/region/region_uuid/comp_id字段
6. extlog(Extended Log)错误类型
ras-extlog-handler.c err_type():
| etype | 名称 | 类别 | 系统影响 |
|---|---|---|---|
| 0 | unknown | 未分类 | — |
| 1 | no error | 信息 | 无 |
| 2 | single-bit ECC | CE | 已纠正 |
| 3 | multi-bit ECC | UE | 数据损坏 |
| 4 | single-symbol chipkill ECC | CE | 整符号纠正 |
| 5 | multi-symbol chipkill ECC | UE | 多个符号损坏;页离线 / DIMM 更换 |
| 6 | master abort | UE | 总线主设备中止 |
| 7 | target abort | UE | 总线目标中止 |
| 8 | parity error | UE | 奇偶错 |
| 9 | watchdog timeout | UE | 内存子系统 watchdog 超时 |
| 10 | invalid address | UE | 访问非法地址 |
| 11 | mirror Broken | UE | 镜像对损坏 → 切备 |
| 12 | memory sparing | INFO | 备用 rank 激活 |
| 13 | scrub corrected error | CE | 巡检纠正 |
| 14 | scrub uncorrected error | UE | 巡检发现 UE |
| 15 | physical memory map-out event | INFO | 物理页被 map-out |
err_severity():recoverable (sev=0, LOG_CRIT) / fatal (sev=1, LOG_EMERG) / corrected (sev=2, LOG_ERR) / informational (sev=3, LOG_INFO)
7. Non-Standard(厂商自定义)解码
7.1 通用 GHES 严重性映射
ras-non-standard-handler.c:
GHES_SEV_NO→ InformationalGHES_SEV_CORRECTED→ CorrectedGHES_SEV_RECOVERABLE→ RecoverableGHES_SEV_PANIC→ Fatal
7.2 HiSilicon HIP08
UUID 1f8161e1-...(Type1) / 45534ea6-...(Type2) / b2889fc9-...(PCIe Local)
OEM Type-1 module_id:
| id | 名称 | 子模块 |
|---|---|---|
| 0 | MN (Miscellaneous Node) | — |
| 1 | PLL | TB_PLL0-3 / TA_PLL0-3 / NIMBUS_PLL0-4 |
| 2 | SLLC | TB_SLLC0-2 / TA_SLLC0-2 / NIMBUS_SLLC0-1 |
| 3 | AA | — |
| 4 | SIOE | TB_SIOE0-3 / TA_SIOE0-3 / NIMBUS_SIOE0-1 |
| 5 | POE | TB_POE / TA_POE |
| 8 | DISP | HAC / PCIE / IO_MGMT / NETWORK |
| 9 | LPC | — |
| 13 | GIC | — |
| 14 | RDE | — |
| 15 | SAS | SAS0/1 |
| 16 | SATA | — |
| 17 | USB | — |
OEM Type-2 module_id:
| id | 名称 |
|---|---|
| 0 | SMMU (HAC/PCIE/MGMT/NIC) |
| 1 | HHA (Hydra Home Agent) — TB/TA_HHA0-1 |
| 2 | PA (Proxy Agent) |
| 3 | HLLC — HLLC0-2 |
| 4 | DDRC (DDR Controller) — TB/TA_DDRC0-3 |
| 5 | L3T (L3 Tag) — TB/TA_PARTITION0-7 |
| 6 | L3D (L3 Data) — TB/TA_BANK0-3 |
PCIe Local sub_module_id:
| id | 名称 |
|---|---|
| 0 | AP (Application Layer) |
| 1 | TL (Transaction Layer) |
| 2 | MAC |
| 3 | DL (Data Link Layer) |
| 4 | SDI |
7.3 HiSilicon Common(Kunpeng916/920/930)
UUID c8b328a8-...。~50 个 module 名(MN/PLL/SLLC/AA/SIOE/POE/CPA/DISP/GIC/ITS/AVSBUS/CS/PPU/SMMU/PA/HLLC/DDRC/L3TAG/L3DATA/PCS/HHA/PCIe Local/SAS/SATA/NIC/RoCE/USB/ZIP/HPRE/SEC/RDE/MEE/L4D/Tsensor/ROH/BTC/HILINK/STARS/SDMA/UC/HBMC/PMC/SCHE/ASMB_DFS/ASMB_NTU/UB/UMMU/PCU/UCMI/DJTAGM/CFGBUS/MPU/CRG)
严重性:NFE=0 (recoverable) / FE=1 (fatal) / CE=2 (corrected) / NONE=3
7.4 Ampere
UUID e8ed898d-...
| type | 名称 | 子错误 |
|---|---|---|
| 0 | CPM | Snoop-Logic, ARMv8 Core 0/1 |
| 1 | MCU (Memory Controller Unit) | ERR0-6, Link Error |
| 2 | MESH | Cross Point, Home Node IO/Memory, CCIX Node |
| 3 | 2P Link Altra | — |
| 4 | 2P Link Altra Max | ERR0-3 |
| 5 | GIC | ERR0-12, ITS 0-7 |
| 6 | SMMU | TBU0-9, TCU |
| 7 | PCIe AER | — |
| 8 | PCIe RASDP | — |
| 9 | OCM (On-Chip Memory) | ERR0-2 |
| 10 | SMPRO | ERR0/ERR1/MPA_ERR |
| 11 | PMPRO | ERR0/ERR1/MPA_ERR |
| 12 | ATF FW | EL3, SPM, Secure Partition |
| 13 | SMPRO FW | — |
| 14 | PMPRO FW | — |
| 63 | BERT | Boot Error Record Table |
Payload types:0=ARMv8 RAS (APEI/BMC) / 1=PCIe AER / 2=PCIe RASDP / 3=Firmware-Specific (ATF/SMpro/PMpro/BERT)
7.5 NVIDIA
UUID 6d5244f2-...(最近提交添加)
解码 nvidia_ns_decode() 字段:signature[16]、error_type、error_instance、severity、socket、number_regs、instance_base、regs[] (addr/value 对)。
7.6 Jaguar Micro(Corsica1.0)
5 个 UUID + 15 个 subsystem:
| subsystem_id | 名称 | 模块 |
|---|---|---|
| 0 | AP/CSUB | CORE |
| 1 | CMN | MXP, HNI, HNF, SBSX, CCG, HND |
| 2 | DDRH | DDRCtrl, DDRPHY, SRAM |
| 3 | DDRV | DDRCtrl, DDRPHY, SRAM |
| 4 | GIC | GICIP, GICSRAM |
| 5 | IOSUB | SMMU(TBU/TCU), NIC450, OTHER(RAM) |
| 6 | SCP | SRAM, WDT, PLL |
| 7 | MCP | SRAM, WDT |
| 8 | IMU0 | SRAM, WDT |
| 9 | DPE | EPG, PIPE, EMEP, IMEP, EPAE, IPAE, ETH, TPG, MIG, HIG, DPETOP, SMMU |
| 10 | RPE | TOP, TXP_RXP, SMMU |
| 11 | PSUB | PCIE0(RAS0/RAS1), UP_MIX, PCIE1, PTOP, N2IF, VPE0/1_RAS, X2RC/X16RC_SMMU, SDMA_SMMU |
| 12 | HAC | SRAM, SMMU |
| 13 | TCM | SRAM, SMMU, IP |
| 14 | IMU1 | SRAM, WDT |
严重性:0=recoverable (NFE) / 1=fatal (FE) / 2=corrected (CE) / 3=none
7.7 Yitian(Alibaba T-Head)
UUID a6980811-...
YITIAN_RAS_TYPE_DDR=0x50:DDR ECC 寄存器 dump(ECCCFG0/1、ECCSTAT、ECCERRCNT、ECCCADDR0/1、ECCCSYN0-2、ECCUADDR0/1、ECCUSYN0-2、ECCBITMASK0-2、ADVECCSTAT、ECCAPSTAT、ECCCDATA0/1、ECCUDATA0/1、ECCSYMBOL、ECCERRCNTCTL/STAT、ECCERRCNT0/1、RESERVED0-2)
8. memory-failure 错误码
ras-memory-failure-handler.c。Page types(来自内核 enum mf_action_page_type):
| 值 | 名称 | 系统影响 |
|---|---|---|
| 0 | MF_MSG_KERNEL | Poison 命中保留内核页;通常 panic |
| 1 | MF_MSG_KERNEL_HIGH_ORDER | 高阶内核分配页;通常致命 |
| 2 | MF_MSG_SLAB | 内核对象 slab;通常致命 |
| 3 | MF_MSG_DIFFERENT_COMPOUND | 锁下复合页变化;重试/中止 |
| 4 | MF_MSG_HUGE | 正在用的 hugepage;迁移/杀消费者 |
| 5 | MF_MSG_FREE_HUGE | 空闲 hugepage;离线 |
| 6 | MF_MSG_UNMAP_FAILED | 无法 unmap;页保持 poison |
| 7 | MF_MSG_DIRTY_SWAPCACHE | 脏 swap cache;可能数据丢失;杀任务 |
| 8 | MF_MSG_CLEAN_SWAPCACHE | 干净 swap cache;丢弃重载 |
| 9 | MF_MSG_DIRTY_MLOCKED_LRU | 脏 mlocked;杀任务 |
| 10 | MF_MSG_CLEAN_MLOCKED_LRU | 干净 mlocked;丢弃重载 |
| 11 | MF_MSG_DIRTY_UNEVICTABLE_LRU | 脏不可驱逐;杀任务 |
| 12 | MF_MSG_CLEAN_UNEVICTABLE_LRU | 干净不可驱逐;丢弃 |
| 13 | MF_MSG_DIRTY_LRU | 脏 LRU;杀任务;可能数据丢失 |
| 14 | MF_MSG_CLEAN_LRU | 干净 LRU;丢弃重载 |
| 15 | MF_MSG_TRUNCATED_LRU | 已截断 LRU;恢复 |
| 16 | MF_MSG_BUDDY | 空闲 buddy 页;从 free list 移除 |
| 17 | MF_MSG_DAX | DAX(pmem/CXL persistent);应用 SIGBUS |
| 18 | MF_MSG_UNSPLIT_THP | THP 拆分失败;杀任务 |
| 19 | MF_MSG_UNKNOWN | 未知;仅记录 |
action_result:
| 值 | 名称 | 含义 |
|---|---|---|
| 0 | MF_IGNORED | 无法处理,忽略 |
| 1 | MF_FAILED | 处理失败;可能需 panic |
| 2 | MF_DELAYED | 延迟处理 |
| 3 | MF_RECOVERED | 成功恢复 |
9. devlink 健康事件
ras-devlink-handler.c:
| 事件 | 字段 | 典型系统影响 |
|---|---|---|
net:net_dev_xmit_timeout |
driver, name, queue | NIC TX 队列 hang;触发 NIC reset |
devlink:devlink_health_report |
bus_name, dev_name, driver_name, reporter_name, msg | 驱动通过 devlink reporter 上报健康/RAS 事件;reporter_name 标识哪个 reporter 触发(如 mlx5 tx/fw/hw_err) |
没有固定错误码 — reporter 内容由驱动定义。
10. diskerror 错误码
ras-diskerror-handler.c — 消费 block:block_rq_error:
| errno | 名称 | 系统影响 |
|---|---|---|
| -EOPNOTSUPP | operation not supported | 块设备不支持该 op |
| -ETIMEDOUT | timeout | IO 超时;重试/换盘 |
| -ENOSPC | critical space allocation | 精简配置空间耗尽 |
| -ENOLINK | recoverable transport | SAS/FC 链路错误,可恢复 |
| -EREMOTEIO | critical target | SCSI 目标严重错误;failover |
| -EBADE | critical nexus | I_T nexus 严重错误 |
| -ENODATA | critical medium | 磁盘介质错误(坏扇区);重映射;换盘 |
| -EILSEQ | protection | T10 PI / DIF 保护错误 |
| -ENOMEM | kernel resource | 内核分配器失败 |
| -EBUSY | device resource | 设备资源耗尽 |
| -EAGAIN | nonblocking retry | 非阻塞重试 |
| -EREMCHG | dm internal retry | device-mapper 内部重试 |
| -EIO | I/O error | 通用 IO 失败 |
11. signal 错误码(ras-signal-handler.c)
SIGBUS codes:
| code | 名称 | 含义 |
|---|---|---|
| 1 | BUS_ADRALN | 地址对齐无效 |
| 2 | BUS_ADRERR | 物理地址不存在 |
| 3 | BUS_OBJERR | 对象特定硬件错误 |
| 4 | BUS_MCEERR_AR | 硬件内存错误已消费(action required);杀进程 + 页离线 |
| 5 | BUS_MCEERR_AO | 硬件内存错误已发现但未消费;可选恢复 |
signal:signal_generate 结果:
| 值 | 名称 | 含义 |
|---|---|---|
| 0 | TRACE_SIGNAL_DELIVERED | 已投递 |
| 1 | TRACE_SIGNAL_IGNORED | 被忽略 |
| 2 | TRACE_SIGNAL_ALREADY_PENDING | 已 pending,no-op |
| 3 | TRACE_SIGNAL_OVERFLOW_FAIL | 队列满 |
| 4 | TRACE_SIGNAL_LOSE_INFO | siginfo 丢失 |
12. reri(RISC-V RAS Error Report Register Interface)
ras-reri-handler.h — RERI_EC_* 错误码:
| 代码 | 名称 | 类别 | 系统影响 |
|---|---|---|---|
| 0 | RERI_EC_NONE | — | 无 |
| 1 | RERI_EC_OUE | Unknown | 未指定错误 |
| 2 | RERI_EC_CDA | Cache | 损坏数据访问 |
| 3 | RERI_EC_CBA | Cache | Cache 块数据错误 |
| 4 | RERI_EC_CSD | Cache | Cache 巡检发现 |
| 5 | RERI_EC_CAS | Cache | Cache 地址/状态错误 |
| 6 | RERI_EC_CUE | Cache | Cache 未指定错误 |
| 7 | RERI_EC_SDC | Microarchitecture | 侦听/目录地址/控制状态错 |
| 8 | RERI_EC_SUE | Unknown | 侦听/目录未指定 |
| 9 | RERI_EC_TPD | TLB | TLB/页表 cache 数据错 |
| 10 | RERI_EC_TPA | TLB | TLB/页表地址控制状态 |
| 11 | RERI_EC_TPU | TLB | TLB/页表未知 |
| 12 | RERI_EC_HSE | Microarchitecture | Hart 状态错 |
| 13 | RERI_EC_ICS | Unknown | 中断控制器状态错 |
| 14 | RERI_EC_ITD | Microarchitecture | 互连数据错 |
| 15 | RERI_EC_ITO | Microarchitecture | 互连其他 |
| 16 | RERI_EC_IWE | Microarchitecture | 内部 watchdog 错 |
| 17 | RERI_EC_IDE | Microarchitecture | 内部数据通路/内存/执行单元 |
| 18 | RERI_EC_SBE | Bus | 系统内存命令/地址总线错 |
| 19 | RERI_EC_SMU | Microarchitecture | 系统内存未指定 |
| 20 | RERI_EC_SMD | Microarchitecture | 系统内存数据错 |
| 21 | RERI_EC_SMS | Microarchitecture | 系统内存巡检发现 |
| 22 | RERI_EC_PIO | Microarchitecture | 协议错非法 IO |
| 23 | RERI_EC_PUS | Microarchitecture | 协议错意外状态 |
| 24 | RERI_EC_PTO | Microarchitecture | 协议错超时 |
| 25 | RERI_EC_SIC | Microarchitecture | 系统内部控制器 |
| 26 | RERI_EC_DPU | Unknown | 延迟错误 passthrough 不支持 |
| 27 | RERI_EC_PCX | Unknown | PCI/CXL 检测到错误 |
Transaction types (TT):0=Unspecified / 1=Custom / 4=Explicit Read / 5=Explicit Write / 6=Implicit Read / 7=Implicit Write
Address info types (AIT):0=None / 1=SPA (Supervisor Physical) / 2=GPA / 3=VA
Source types:0=CPU / 1=IOMMU / 2=Unknown
严重性推导:UEC→FATAL, UED→RECOVERABLE, CE→CORRECTED, else INFORMATIONAL
行为:CPU FATAL/RECOVERABLE 触发 ras_record_cpu_error(hart_id)(需 HAVE_CPU_FAULT_ISOLATION);RECOVERABLE+ 还会上报 ABRT
13. erst(APEI ERST)— MCE 重放
ras-erst.c:
- 仅消费
/sys/fs/pstore/erst/下mce-erst*文件 - 通过现有 MCE handler(Intel
parse_intel_event、AMD K8parse_amd_k8_event、AMD SMCAparse_amd_smca_event)解码 - 发射合成
mce_erst_record事件 ERST_DELETE=1时删除文件- 不支持 CPER generic / AER 等其他 APEI ERST record 类型
14. 全局 syslog 严重性映射
| 事件类别 | 日志级别 |
|---|---|
| MCE Uncorrected / Deferred | LOG_CRIT |
| MCE Fatal | LOG_CRIT |
| AER Corrected | LOG_ERR |
| AER Uncorrected Non-Fatal | LOG_CRIT |
| AER Uncorrected Fatal | LOG_EMERG |
| MC Corrected | LOG_ERR |
| MC Uncorrected / Deferred | LOG_CRIT |
| MC Fatal | LOG_EMERG |
| MC Info | LOG_DEBUG |
| extlog recoverable | LOG_CRIT |
| extlog fatal | LOG_EMERG |
| extlog corrected | LOG_ERR |
| extlog informational | LOG_INFO |
| CXL Poison | LOG_ERR |
| CXL AER UE | LOG_CRIT |
| memory-failure | LOGLEVEL_ALERT |
| diskerror | LOG_ERR |
15. 关键"是什么"vs"做什么"总结
| 类别 | rasdaemon 记录 | rasdaemon 自动动作 | 内核实际动作 |
|---|---|---|---|
| MCE CE | ✓ | 无 | 自动纠正 |
| MCE UC | ✓ | 无 | #MC handler(杀进程/panic) |
| MC CE | ✓ | 页面/行 PFA → sysfs 写 | soft/hard_offline_page |
| MC UE/Fatal | ✓ | 触发 MC_UE_TRIGGER 脚本 |
MCE / SIGBUS |
| AER CE | ✓ | 触发 AER_CE_TRIGGER 脚本 |
AER 自动 retrain |
| AER UE | ✓ | 触发 AER_UE_TRIGGER 脚本 |
AER reset / hot-plug 处理 |
| CXL Poison | ✓ | 无 | 内核错误处理 |
| CXL DER CE-threshold | ✓ | ras_hw_threshold_pageoffline() | 软离线 |
| ARM | ✓ | CPU 错误计数(可选) | GHES 处理(已发生) |
| extlog | ✓ | 无 | 已发生 |
| memory-failure | ✓ | 触发 MEM_FAIL_TRIGGER 脚本 |
hwpoison 完成 |
| devlink | ✓ | 无 | 驱动内部处理 |
| diskerror | ✓ | 无 | 块设备重试 / 上层处理 |
| signal SIGBUS 4/5 | ✓ | 无 | kill task / 投递信号 |
| reri | ✓ | CPU 错误计数(可选) | 内核处理(已发生) |
| erst | ✓ | ERST_DELETE=1 时删文件 |
启动时 pstore 重放 |
16. 关键文件索引
| 文件 | 行数 | 作用 |
|---|---|---|
ras-mce-handler.c |
636+ | MCE dispatcher, report_mce_event, ras_offline_mce_event |
mce-intel.c |
332+ | 通用 Intel MCA 解码 + AR 状态 + memory controller |
mce-intel-{nehalem,sb,ivb,haswell,broadwell-de,broadwell-epex,dunnington,knl,skylake-xeon,i10nm,granite,tulsa,p4-p6}.c |
16 个 | 平台特定 PCU/QPI/UPI/M2M/iMC 错误码 |
mce-amd.c |
124+ | AMD 通用 + decode_amd_errcode |
mce-amd-k8.c |
252+ | K8 北桥扩展错误 |
mce-amd-smca.c |
998+ | SMCA 全 bank 表(LS/IF/L2/DE/EX/FP/L3/CS/PIE/UMC/MA_LLC/PB/PSP/SMU/MP5/MPDMA/NBIO/PCIE/XGMI_PCS/XGMI_PHY/NBIF/SATA/USB/USR_DP/USR_CP/GMI_PCS) |
mce-zhaoxin-kh50000.c |
400+ | Zhaoxin KH-50000 全部错误码 |
ras-mc-handler.c |
348+ | EDAC 5 个错误类型 + PFA 触发 |
ras-page-isolation.c |
850+ | 页面/行 PFA + sysfs 写 |
ras-aer-handler.c |
349+ | PCIe AER 24+ 位解码 |
ras-arm-handler.c |
600+ | ARM PEI 解码 |
ras-cxl-handler.c |
1674+ | CXL 9 个事件 + 30+ AER 子码 + 5 类 record |
ras-extlog-handler.c |
— | extlog 16 个 err_type + 4 严重性 |
ras-non-standard-handler.c |
— | CPER 段分发 |
non-standard-{hisi_hip08,hisilicon,ampere,nvidia,jaguarmicro,yitian}.c |
— | 6 个 vendor 解码器 |
ras-memory-failure-handler.c |
229+ | 20 page type + 4 result |
ras-devlink-handler.c |
— | devlink 2 事件 |
ras-diskerror-handler.c |
— | 13 errno |
ras-signal-handler.c |
— | 5 SIGBUS code + 5 signal result |
ras-reri-handler.c |
— | 27 RERI_EC_* |
ras-erst.c |
— | MCE ERST 重放 |
ras-events.c |
1249+ | 全部事件注册和分发 |
ras-events.h |
— | 全部枚举(severity、page_type、event type) |
ras-record.c |
— | 全部 SQLite schema |
17. 调试与快速定位建议
| 想知道什么 | 看哪里 |
|---|---|
| MCE 错误码 → 字符串 | mce-error.c 工具(util/) |
| CXL AER 详细位 | util/ras-mc-ctl.in:1205-1357 |
| SQLite schema | ras-record.c |
| 严重性映射 | 各 *-handler.c 中 loglevel_str[] 数组 |
| 触发脚本 | trigger.c + ras-*-handler.c *_trigger_setup |
| 页面退役 | ras-page-isolation.c + sysfs soft_offline_page |
| ERST 重放 | /sys/fs/pstore/erst/ + ras-erst.c |
| ABRT 报告 | ras-report.c |
| OpenBMC SEL | unified-sel.c |
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐




所有评论(0)