A while ago we noticed log messages on one of the servers that a hard drive was going bad. The logs showed errors reading certain sectors of one of the drives in a software RAID-1 volume. Here’s how we went about diagnosing the faulty drive and replaced it.

Background

The server has several Linux software RAID (mdraid) volumes and the drives are connected to the server’s onboard Intel AHCI SATA controller. The Intel chipset and the Linux AHCI driver provides hot-swap capability and are two required components of hot-swappability. Furthermore, the drives themselves are mounted in hot-swap carriers that connect to a backplane in the server that completes the setup, allowing us to to remove and add drives when needed. Naturally, one has to be careful not to remove drives that are in use. With the chipset, driver and physical parts in place, you also need software to manage the hard drives – taking them offline before removing them and functionality to set up new drives to replace the old ones in the RAID array(s). In Linux, this is handled by the mdraid management tool ”mdadm”.

Diagnosis

Judging from the logged errors, the faulty drive was /dev/sde and with that I examined the drive using the smarctl utility from the smartmontools package and initiated an offline test of the drive:

    smartctl --test=offline /dev/sde

When you launch that test, an estimate of when the test is done will be printed out by smartctl. The offline test will run a shorter test of the drive and update the SMART values, which can then be read out using

    smartctl -A /dev/sde

Here’s what was printed out:

# smartctl -A /dev/sde
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       1050
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       27
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6688
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       26
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       19
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       7
194 Temperature_Celsius     0x0022   115   103   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   192   192   000    Old_age   Always       -       1458
198 Offline_Uncorrectable   0x0030   200   192   000    Old_age   Offline      -       5
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   182   000    Old_age   Offline      -       0

Looking at this, attributes 197 and 198 caught my eyes. There’s a lot of current pending sectors (1458) and a non-zero number (5) for attribute 198, Offline Uncorrectable. That attribute denotes the number of defective sectors that were found during the offline scan. The pending sectors count doesn’t mean that there really are 1458 broken sectors, but it indicates ”suspected” sectors that for some reason have not been readable. However, the pending sectors count along with the five uncorrectable sectors and the kernel errors about read errors suggested that this drive needed to be replaced as soon as possible.

The drive was part of a two-disk RAID-1 volume, so the next step was to remove this drive from the RAID-volume and the server itself, get it replaced and RMA:ed and a new disk being put to use in its place. Here is how we did it. Some details first:

  • The RAID-1 volume was /dev/md4
  • It’s comprised of two drives, dev/sde and /dev/sdf
  • Each drive has one large partition on them, marked as type fd aka ”Linux RAID autodetect”

How we did it

  1. Make sure you identify exactly which drive it is that’s broken and what RAID-volume it belongs to. Our server has six drives in it and the broken one was the fifth drive, hence /dev/sde. To identify the correct drive check the kernel logs, examine the mdraid status using ”cat /proc/mdstat” and make good use of the mdadm RAID management utility to make sure you identify the correct drive. In our case, it was the md4 RAID volume and the /dev/sde drive.
  2. Buy a replacement drive of at least the same size and arrange with your vendor to get the faulty drive returned and exchanged.
  3. Save the partition table from the faulty drive, assuming it’s still readable. This can be done with the sfdisk partition tool:
    sfdisk -d >sde-partitions.dat
  4. Use mdadm to mark the faulty drive (actually partition) in the affected RAID volume:
    mdadm /dev/md4 --fail /dev/sde1
  5. Next, remove the faulty partition from the RAID volume:
    mdadm /dev/md4 --remove /dev/sde1
  6. Bring the new disk to the server and swap out the faulty drive. Make sure you remove the correct drive. In our case, it was the fifth drive, so we removed drive carrier #5. When you do this, have a monitor connected to the server and you should notice a kernel message popping up on the console showing that the OS has detected a drive being removed. When you plug in the new drive, another message is shown indicating that a new drive has been added to the machine. On our server, the new drive was detected as ”sde”, just like the old drive.
  7. Log on to the server again and partition the new drive. Make use of the old partition table dump created above:
    sfdisk /dev/sde < sde-partitions.dat
  8. Add the new disk to the RAID volume, in our case md4:
    mdadm /dev/md4 --add /dev/sde1
  9. Assuming everything went well, you should be able to see the RAID volume being rebuilt:
    # cat /proc/mdstat
    Personalities : [raid1]
    md4 : active raid1 sde1[2] sdf1[1]
    976759936 blocks [2/1] [_U]
    [==>..................]  recovery = 12.7% (124386752/976759936) finish=133 view publisher site.3min speed=106564K/sec
  10. Done!