linux – 硬重置链接异常Emask 0x50 SAct 0x0 SErr 0x4090800动
以下情况:
带内核的高效linux debian 7服务器 制造商:Supermicro SATA控制器:英特尔公司Lynx Point 6端口SATA控制器1 [AHCI模式](转速04) 2x SSD,2x硬盘 每个驱动器都可以做Sata Rev3(6.0Gb / s) hdparm -I /dev/sd[a-d]|egrep "Model|speed|Transport" Model Number: TOSHIBA THNSNH128GBST Transport: Serial,ATA8-AST,SATA 1.0a,SATA II Extensions,SATA Rev 2.5,SATA Rev 2.6,SATA Rev 3.0 * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * SMART Command Transport (SCT) feature set Model Number: TOSHIBA THNSNH128GBST Transport: Serial,SATA Rev 3.0 * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * SMART Command Transport (SCT) feature set Model Number: ST2000VX000-1CU164 Transport: Serial,SATA Rev 3.0 * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * SMART Command Transport (SCT) feature set 内核消息(至少对我来说)建议所有4个驱动器都存在问题,这让我相信它是可能有问题的SATA控制器. ata1: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen ata1: irq_stat 0x00400040,connection status changed ata1: SError: { HostInt PHYRdyChg 10B8B DevExch } ata1: hard resetting link ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen ata2: irq_stat 0x00400040,connection status changed ata2: SError: { HostInt PHYRdyChg 10B8B DevExch } ata2: hard resetting link ata4: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen ata4: irq_stat 0x00400040,connection status changed ata4: SError: { HostInt PHYRdyChg 10B8B DevExch } ata4: hard resetting link ata3: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen ata3: irq_stat 0x00400040,connection status changed ata3: SError: { HostInt PHYRdyChg 10B8B DevExch } ata3: hard resetting link ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata2.00: configured for UDMA/33 ata2: EH complete ata1.00: configured for UDMA/33 ata1: EH complete ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out ata3.00: configured for UDMA/33 ata3: EH complete ata4.00: configured for UDMA/33 ata4: EH complete 我已经弄清楚了什么(或者相信已经弄明白了) 命令SECURITY FREEZE LOCK和DEVICE CONFIGURATION OVERLAY对此问题并不重要. 在阅读大约20个bug报告和大量文档时,一些链接的人建议禁用NCQ,我做了. 首先是一个设备,等待1天后检查错误是否重复它再次发生,我为所有4个设备禁用它 echo "1" >/sys/block/sdc/device/queue_depth 情况没有明显变化. https://ata.wiki.kernel.org/index.php/Libata_error_messages https://wiki.archlinux.org/index.php/Solid_State_Drives#Resolving_NCQ_errors 其他人建议使用sata电缆,甚至是电路板驱动器之间不兼容. 然而,因为我似乎要么在一个驱动器上有问题,这填充到所有4个,或直接在所有4个设备上问题我无法进一步查明问题. 因为这是一个生产服务器,可以将此服务器放下来进行维护(也就是bios /内核参数更改),但我想尽可能防止这种情况发生. 据主持人说,这可能与电源管理有关: https://bugzilla.kernel.org/show_bug.cgi?id=74961 echo "medium_power" >/sys/class/scsi_host/host0/link_power_management_policy 在更改之前,这被设置为max_performance. 这也没有帮助. HDD / SDD的智能值是可以的,没什么太明显的. 请注意,UDMA值现在似乎只有33. 在启动服务器时,这是sata链接速度值: [ 3.161850] ata6: SATA link down (SStatus 0 SControl 300) [ 3.161867] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 3.161882] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 3.161894] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 3.161907] ata5: SATA link down (SStatus 0 SControl 300) 这种情况可能只发生在HDD的高负载上,我没有测试,因为它会明显影响服务器性能. SSD上没有负载,它们已安装但未被任何进程使用. 据我所知,RAM是ECC. dmidecode -t 17 # dmidecode 2.11 SMBIOS 2.7 present. Handle 0x0023,DMI type 17,34 bytes Memory Device Array Handle: 0x0022 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 8192 MB Form Factor: DIMM Set: None Locator: P1-DIMMA1 Bank Locator: P0_Node0_Channel0_Dimm0 Type: DDR3 Type Detail: Synchronous Speed: 1600 MHz Manufacturer: Samsung Serial Number: 373A6427 Asset Tag: 9876543210 Part Number: M391B1G73QH0-CK0 Rank: 2 Configured Clock Speed: 1600 MHz 如果我能提供额外的信息,请告诉我,因为我缺乏想法下一步该做什么. 解决方法
服务器体验的基本上是在与驱动器通信出现问题后,以较低的链接速度进行SATA重新协商.
这些因素可以在这里起作用(按概率排序) >非常高延迟的IOPS操作(例如:由SSD控制器的垃圾收集引起)导致SATA命令超时.您的驱动器是否支持SATA Trim命令?如果是这样,请尝试运行fstrim /.它有什么改变吗?>主板/内存不良:您的内存是否受ECC保护?如果没有,如果可以的话,运行一个延长的(2小时)memtest86测试会话>硬件/软件驱动程序不兼容>糟糕的SATA控制器:虽然不太可能,但您无法完全排除它>坏的SATA电缆/驱动器:由于所有四个驱动器都会给您带来问题,因此这种可能性很小 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |