Personal home page for:
Patrick Mackinlay
Home | Contact | Secure |

Disk failures on patrick.ritherdon

Summerises issues with the disk failures (CRC failure on read) starting on 5th of August 2018.

DateDisk serialSoftware portHardware portCable
16 Sep 2019Y7948UZASada1
13 Aug 2019Y781ZU9ASada52nd upperblue cable (b1)
13 Jan 2019Y781ZU9ASada21st lowerblue cable (b1)
23 Nov 201Y7948UZASada31st upper
9 Oct 2018Y781ZU9ASada21st lowerblue cable (b1)

Full event decriptions listed in reverse chronological order below.

16 Sep 2019

Disk ada1 (Y7948UZAS) failed and was removed from the data zpool and the swap gmirror. The disk was re-added with the commands:

gmirror forget swapm0 ada1p1
gmirror insert swapm0 ada1p1
geli attach /dev/ada1p2
zpool online data ada1p2.eli
 

The dmesg output for the error was much the same as usual and is listed below:

[2763238] (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 28 f8 ac 0b 40 5a 00 00 00 00 00
[2763238] (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
[2763238] (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
[2768398] (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 30 20 db a7 40 c4 00 00 00 00 00
[2768398] (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
[2768398] (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
...
[2856519] (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 f0 49 9c 40 5a 00 00 00 00 00
[2856519] (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
[2856519] (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
...
[2865815] (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 28 03 00 40 02 00 00 00 00 00
[2865815] (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
[2865815] (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
[2865815] (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 28 85 e0 40 e8 00 00 00 00 00
[2865815] (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
[2865815] (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
[2865815] (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 28 87 e0 40 e8 00 00 00 00 00
[2865815] (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
[2865815] (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
[2865815] ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1:  s/n Y7948UZAS detached
 
14 Aug 2019

The machine was rebooted, note that the firefox process 87034 could not be killed! The blue cable b1 which was attached to ata5 (2nd upper hardware port, Y781ZU9AS) was replaced with the orange cable o3 from one of the SSD disks (2nd lower hardware port). The cables b1 and o3 were labelled with stickers, b2 is the remaing cable, other cables havn't been assiged a name yet. The hardware ports where not changed, hence disk configuration is the same.

cat /var/run/dmesg.boot | grep ada | grep Serial
 ada0: Serial Number Z4ZAFGR3
 ada1: Serial Number Y7948UZAS
 ada2: Serial Number S2R6NX0HC55312R
 ada3: Serial Number Z4ZAGMSH
 ada4: Serial Number 978KYZWAS
 ada5: Serial Number Y781ZU9AS
 ada6: Serial Number 978KYZ3AS
 ada7: Serial Number S2R6NX0HC55257B
 
13 Aug 2019

Disk ada5 (Y781ZU9AS) failed twice, CRC issues, it was eventually removed from the swap mirror and from the zpool. The machine took a few minutes to recover and it did not recover completely (ps aux blocks and never returns?).

Below is a summary of dmesg errors, note that /var/run/dmesg.boot was created at 19 Jun 22:17. Note that it looks linke firefox with pid 87034 has a swap in failure, but other than ps all work fine (so did lsof -p 87034).

[4736162] (ada5:ahcich5:0:0:0): Retrying command, 3 more tries remain
[4736162] (ada5:ahcich5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 10 36 7d 40 ab 00 00 00 00 00
[4736162] (ada5:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error
[4736162] (ada5:ahcich5:0:0:0): Retrying command, 3 more tries remain
...
[4736184] (ada5:ahcich5:0:0:0): Error 5, Retries exhausted
[4736184] (ada5:ahcich5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 f0 1d 94 40 6b 00 00 00 00 00
[4736184] (ada5:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error
[4736184] (ada5:ahcich5:0:0:0): Error 5, Retries exhausted
[4736184] GEOM_ELI: g_eli_read_done() failed (error=5) ada5p2.eli[READ(offset=906913091584, length=8192)]
...
[4736184] (ada5:ahcich5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 10 4d 7d 40 ab 00 00 01 00 00
[4736184] (ada5:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error
[4736184] (ada5:ahcich5:0:0:0): Error 5, Retries exhausted
[4736184] GEOM_ELI: g_eli_read_done() failed (error=5) ada5p2.eli[READ(offset=1455902543872, length=983040)]
...
[4736246] (ada5:ahcich5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 20 73 7d 40 ab 00 00 01 00 00
[4736246] (ada5:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error
[4736246] (ada5:ahcich5:0:0:0): Retrying command, 3 more tries remain
[4736277] ahcich5: Timeout on slot 19 port 0
[4736277] ahcich5: is 00000000 cs 00700000 ss 00780000 rs 00780000 tfd 50 serr 00080800 cmd 0000d317
[4736277] (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 40 e8 69 40 d7 00 00 00 00 00
[4736277] (ada5:ahcich5:0:0:0): CAM status: Command timeout
...
[4736279] (ada5:ahcich5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 60 8b 7d 40 ab 00 00 00 00 00
[4736279] (ada5:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error
[4736279] (ada5:ahcich5:0:0:0): Retrying command, 3 more tries remain
[4736313] ahcich5: Timeout on slot 31 port 0
[4736313] ahcich5: is 00000000 cs 00000007 ss 80000007 rs 80000007 tfd 50 serr 00080000 cmd 0000df17
[4736313] (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 f8 6c 99 40 84 00 00 00 00 00
[4736313] (ada5:ahcich5:0:0:0): CAM status: Command timeout
...
[4736314] (ada5:ahcich5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 c0 a8 b2 7d 40 ab 00 00 00 00 00
[4736314] (ada5:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error
[4736314] (ada5:ahcich5:0:0:0): Error 5, Retries exhausted
[4736314] GEOM_ELI: g_eli_read_done() failed (error=5) ada5p2.eli[READ(offset=1455915728896, length=1015808)]
[4736315] (ada5:ahcich5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 e0 48 00 00 40 02 00 00 00 00 00
[4736315] (ada5:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error
...
[4736413] (ada5:ahcich5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 a8 33 75 40 ab 00 00 00 00 00
[4736413] (ada5:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error
[4736413] (ada5:ahcich5:0:0:0): Retrying command, 1 more tries remain
[4736445] ahcich5: Timeout on slot 21 port 0
[4736445] ahcich5: is 00000000 cs 07c00000 ss 07e00000 rs 07e00000 tfd 50 serr 00080000 cmd 0000d517
[4736445] (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 90 4c 26 40 b9 00 00 00 00 00
[4736445] (ada5:ahcich5:0:0:0): CAM status: Command timeout
[4736445] (ada5:ahcich5:0:0:0): Retrying command, 3 more tries remain
[4736462] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736476] ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
[4736482] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736502] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736506] ahcich5: Timeout on slot 27 port 0
[4736506] ahcich5: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 80 serr 00080000 cmd 0000db17
[4736506] (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
[4736506] (aprobe0:ahcich5:0:0:0): CAM status: Command timeout
[4736506] (aprobe0:ahcich5:0:0:0): Retrying command, 0 more tries remain
[4736522] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736537] ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
[4736542] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736562] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736567] ahcich5: Timeout on slot 28 port 0
[4736567] ahcich5: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd 80 serr 00080000 cmd 0000dc17
[4736567] (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
[4736567] (aprobe0:ahcich5:0:0:0): CAM status: Command timeout
[4736567] (aprobe0:ahcich5:0:0:0): Error 5, Retries exhausted
[4736582] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736599] ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
[4736602] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736603] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 11090150, size: 12288
[4736603] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9337145, size: 8192
[4736622] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736623] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 11090150, size: 12288
[4736623] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9337145, size: 8192
[4736629] ahcich5: Timeout on slot 29 port 0
[4736629] ahcich5: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00080000 cmd 0000dd17
[4736629] (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
[4736629] (aprobe0:ahcich5:0:0:0): CAM status: Command timeout
[4736629] (aprobe0:ahcich5:0:0:0): Error 5, Retry was blocked
[4736629] ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
ada5:  s/n Y781ZU9AS detached
[4736629] GEOM_ELIGEOM_MIRROR: g_eli_read_done() failed (error=6):  ada5p2.eli[READ(offset=856656502784, length=12288)]
[4736629] GEOM_ELI: g_eli_write_done() failed (error=6) ada5p2.eli[WRITE(offset=1525293899776, length=28672)]
[4736629] GEOM_ELI: g_eli_read_done() failed (error=6) ada5p2.eli[READ(offset=1525513781248, length=4096)]
[4736629] GEOM_ELI: g_eli_write_done() failed (error=6) ada5p2.eli[WRITE(offset=1526799040512, length=20480)]
[4736629] GEOM_ELI: g_eli_write_done() failed (error=6) ada5p2.eli[WRITE(offset=1526799060992, length=16384)]
[4736629] Request failed (error=6). ada5p1[READ(offset=3885199360, length=8192)]GEOM_ELI
[4736629] GEOM_MIRROR: Device swapm2: provider ada5p1 disconnected.: g_eli_read_done() failed (error=6) ada5p2.eli[READ(offset=270336, length=8192)]
[4736629] GEOM_ELI: g_eli_read_done() failed (error=6) ada5p2.eli[READ(offset=1983218458624, length=8192)]
[4736629] GEOM_ELI: g_eli_read_done() failed (error=6) ada5p2.eli[READ(offset=1983218720768, length=8192)]
[4736629] 
[4736629] GEOM_ELI: g_eli_read_done() failed (error=6) mirror/swapm2.eli[READ(offset=11065507840, length=12288)]
[4736629] swap_pager: I/O error - pagein failed; blkno 11090150,size 12288, error 6
[4736629] vm_fault: pager read error, pid 87034 (firefox)
[4736629] GEOM_ELI: g_eli_write_done() failed (error=6) ada5p2.eli[WRITE(offset=856489246720, length=32768)]
[4736629] GEOM_ELI: g_eli_write_done() failed (error=6) ada5p2.eli[WRITE(offset=1525293928448, length=28672)]
[4736642] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736660] ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
[4736662] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736682] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 9602933, size: 28672
[4736783] ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
[4736784] GEOM_ELI: Device ada5p2.eli destroyed.
[4736784] GEOM_ELI: Detached ada5p2.eli on last close.
[4736784] (ada5:ahcich5:0:0:0): Periph destroyed
 

Disk details follow, note that it seems that CRC errors are now (new version of FreeBSD?) recorded on the disk as part of SMART.

 cat /var/run/dmesg.boot | grep ada | grep Serial
  ada0: Serial Number Z4ZAFGR3
  ada1: Serial Number Y7948UZAS
  ada2: Serial Number S2R6NX0HC55312R
  ada3: Serial Number Z4ZAGMSH
  ada4: Serial Number 978KYZWAS
  ada5: Serial Number Y781ZU9AS
  ada6: Serial Number 978KYZ3AS
  ada7: Serial Number S2R6NX0HC55257B
 smartctl -a /dev/ada5 | grep Error | grep occurred
  Error 3 occurred at disk power-on lifetime: 10788 hours (449 days + 12 hours)
  Error 2 occurred at disk power-on lifetime: 10788 hours (449 days + 12 hours)
  Error 1 occurred at disk power-on lifetime: 5703 hours (237 days + 15 hours)
 
14 Jan 2019
Swapped ada2 (blue) with ada5 (red) cable
 cat /var/run/dmesg.boot | grep ada | grep Serial
  ada0: Serial Number Z4ZAFGR3
  ada1: Serial Number Y7948UZAS
  ada2: Serial Number S2R6NX0HC55312R
  ada3: Serial Number Z4ZAGMSH
  ada4: Serial Number 978KYZWAS
  ada5: Serial Number Y781ZU9AS
  ada6: Serial Number 978KYZ3AS
  ada7: Serial Number S2R6NX0HC55257B
 
13 Jan 2019
"TOSHIBA HDWD120, S/N:Y781ZU9AS, WWN:5-000039-fd1c0e80d, FW:MX4OACF0, 2.00 TB" Failed, many CRC errors and then it was essentially removed as a device.
It eventualy re-appeared as a device .
zpool data was degraded (/dev/ada2p2.eli appeared as REMOVED) and gmirror mirror/swapm1 degraded, but the computer survived.
 To fix mirror/swapm1
  gmirror forget swapm1
  gmirror insert swapm1 /dev/ada2p1
 To fix zfs data
  geli attach /dev/ada2p2
  zpool online data /dev/ada2p2.eli
 cat /var/run/dmesg.boot | grep ada | grep Serial
  ada0: Serial Number Z4ZAFGR3
  ada1: Serial Number Y7948UZAS
  ada2: Serial Number Y781ZU9AS
  ada3: Serial Number Z4ZAGMSH
  ada4: Serial Number 978KYZWAS
  ada5: Serial Number S2R6NX0HC55312R
  ada6: Serial Number 978KYZ3AS
  ada7: Serial Number S2R6NX0HC55257B
 
23 Nov 2018
Machine crashed, after reset none of the non SSD disks were found.
ada3 had CRC failures in the log, this device is the TOSHIBA HDWD120, S/N:Y7948UZAS, WWN:5000039fd1c1f2a9, FW:MX4OACF0.
Removed graphics card.
ada3 is the first upper SATA port when looking at the motherboard SATA ports.
I unplugged drive TOSHIBA HDWD120, S/N:Y7948UZAS, WWN:5000039fd1c1f2a9, FW:MX4OACF0 and plugged it into the ASmeddia card (ada1).
I unplugged drive ST2000DM006-2DM1640, S/N:Z4ZAGMSHS, WWN:5000c500b005b253, FW:CC26 from ada1 and plugged it into ada3.
Note that ST2000DM006-2DM1640, S/N:Z4ZAGMSHS, WWN:5000c500b005b253, FW:CC26 had the better blue cable, so now the drives blue cables are plugged into ada2 and ada3.
9 Oct 2018
ada2 "TOSHIBA HDWD120, S/N:Y781ZU9AS, WWN:5-000039-fd1c0e80d, FW:MX4OACF0, 2.00 TB" Failed, all ok after power off and on.
ada2 is the first lower SATA port when looking at the motherboard SATA ports.
ada2 has failed in the past even when it was a different hard disk, the SATA cable was replaced with a blue one.
On this occassion I changed the power cables, I swapped the ones used for the SSD's with the two HDD's including ada2.
5 Aug 2018
Changed swaps devices to mirrors to prevent OS from crashing when a disk fails (a disk failed yesterday probably because of the heat).
Contact MeDesigned for w3c compliance (XHTML and CSS).Sep 16 2019