linux – 如何从MCE消息中找到故障内存模块?
我试图了解MCE消息,以找出服务器上哪个内存模块坏.此消息出现在/var/log/kern.log中的一台服务器中,今天冻结了两次.
Apr 13 22:39:22 mbox kernel: [36247975.116860] sbridge: HANDLING MCE MEMORY ERROR Apr 13 22:39:22 mbox kernel: [36247975.116867] CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010090 Apr 13 22:39:22 mbox kernel: [36247975.116869] TSC 0 ADDR 4a0d75900 MISC 21405cdc86 PROCESSOR 0:206d7 TIME 1428957562 SOCKET 0 APIC 0 Apr 13 22:39:22 mbox kernel: [36247975.951013] EDAC MC0: 1 CE memory read error 我怀疑一个坏的内存模块.服务器是2x Xeon E5-2650,带有8x8Go内存模块(每个CPU有8个内存插槽) 这是lshw的内存模块数量: *-memory:0 description: System Memory physical id: 2d slot: System board or motherboard *-bank:0 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-197.A vendor: Kingston physical id: 0 serial: B83AE5C2 slot: P1_DIMMA1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:1 description: DIMM Synchronous [empty] product: Dimm1_PartNum vendor: Dimm1_Manufacturer physical id: 1 serial: Dimm1_SerNum slot: P1_DIMMA2 width: 64 bits *-bank:2 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 2 serial: EC309238 slot: P1_DIMMB1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:3 description: DIMM Synchronous [empty] product: Dimm4_PartNum vendor: Dimm4_Manufacturer physical id: 3 serial: Dimm4_SerNum slot: P1_DIMMB2 width: 64 bits *-bank:4 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 4 serial: E9305438 slot: P1_DIMMC1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:5 description: DIMM Synchronous [empty] product: Dimm7_PartNum vendor: Dimm7_Manufacturer physical id: 5 serial: Dimm7_SerNum slot: P1_DIMMC2 width: 64 bits *-bank:6 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 6 serial: E7305738 slot: P1_DIMMD1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:7 description: DIMM Synchronous [empty] product: Dimm10_PartNum vendor: Dimm10_Manufacturer physical id: 7 serial: Dimm10_SerNum slot: P1_DIMMD2 width: 64 bits *-memory:1 description: System Memory physical id: 3f slot: System board or motherboard *-bank:0 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-197.A vendor: Kingston physical id: 0 serial: B63A08C3 slot: P2_DIMME1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:1 description: DIMM Synchronous [empty] product: Dimm1_PartNum vendor: Dimm1_Manufacturer physical id: 1 serial: Dimm1_SerNum slot: P2_DIMME2 width: 64 bits *-bank:2 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 2 serial: EA309638 slot: P2_DIMMF1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:3 description: DIMM Synchronous [empty] product: Dimm4_PartNum vendor: Dimm4_Manufacturer physical id: 3 serial: Dimm4_SerNum slot: P2_DIMMF2 width: 64 bits *-bank:4 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 4 serial: E7305938 slot: P2_DIMMG1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:5 description: DIMM Synchronous [empty] product: Dimm7_PartNum vendor: Dimm7_Manufacturer physical id: 5 serial: Dimm7_SerNum slot: P2_DIMMG2 width: 64 bits *-bank:6 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 6 serial: E7305B38 slot: P2_DIMMH1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:7 description: DIMM Synchronous [empty] product: Dimm10_PartNum vendor: Dimm10_Manufacturer physical id: 7 serial: Dimm10_SerNum slot: P2_DIMMH2 width: 64 bits *-memory:2 UNCLAIMED physical id: 7 *-memory:3 UNCLAIMED physical id: 9 您可以注意到,#5银行没有内存模块.所以我的问题是:你是否同意这条消息是关于内存故障的?如果是这样,我怎样才能找到要替换的模块? 解决方法
这些错误来自EDAC – 错误检测和纠正
edac_mc设备类. 您收到的事件是CE事件(可识别的错误).这些都表明DIMM开始出现故障. EDAC没有报告任何关于它所引用的内存行或通道的具体信息,因此很难确定哪一个要替换,直到那个失败. 但是看看:/ sys / devices / system / edac / mc / mc *这可能会告诉你更多关于哪个行/ dimm可能是错误的行/ dimm. 例如 ls -s / sys / devices / system / edac / mc / mc0 看一下ce_count字段. 在旁注: 系统仍然可以继续运行,但安全性较低.展示CE的内存DIMM的预防性维护和主动部件更换可以降低可怕的UE(不可纠正的错误)事件和系统“恐慌”的可能性. 有关edac的更多信息: https://www.kernel.org/doc/Documentation/edac.txt (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |