Replacing A Failed USB Disk In A Raspberry Pi-Based RAID Mirror

My previous post went into how to create a simple but functional NAS with a Raspberry Pi 4B and two USB-attached SATA disks. In the two weeks or so that it’s been running, the NAS I built has performed very well and has been reliable (hopefully I won’t regret typing that).

But what to do WHEN a disk fails? Disks fail – even that fancy new enterprise-grade SSD that cost an arm and a leg will fail at some point. The good news is that if you’re using mdadm to provide some kind of redundancy with your disks, things should still be working if a disk fails. The bad news is that unless you’ve got a RAIDset that can specifically tolerate more than one failure (like RAID 6), you need to replace that failed disk ASAP.

I’m confident that I’ll be able to recover from losing a disk in my shiny new NAS, but I’m not one to tempt fate so I built another RAIDset with a spare Pi and two 64GB SanDisk USB sticks to play around with instead. They’re slower than the disks so things like the speed the RAIDset syncs back up is going to be different than in my previous post.

So here’s the setup – it’s a Raspberry Pi 4B (2GB) with two 64GB USB flash drives in a RAID 0 (mirror) configuration.

Here it is, working properly, with the output of cat /proc/mdstat:

and checking to see if it’s mounted using df:

To simulate a disk failure, I removed one of the USB sticks while everything was running. Here’s the output of dmesg showing the disconnection and that mdadm is soldiering on with only one disk:

Looking at the list of USB-connected devices only shows one SanDisk device:

And now the output of cat /proc/mdstat is showing a failed disk (note the “U_”):

The good news is that yes, /dev/md0 is still mounted and usable, even though it’s in a degraded state.

I reformatted the USB stick on my Windows PC so the data that was on there was lost, then reconnected it to the Pi:

There are two SanDisk devices again.

And here’s the output of dmesg again – you can see the time difference between the failure and when the “new” disk was connected:

Note that the messages both of the failure and of the newly connected USB stick show them as sdb. It could just as easily have been sda, so make sure you check to see which one failed – and, more importantly, which one didn’t!

So now there are two disks connected again, but only one of them has the RAIDset data on it. In this case, sda is the one with the data that needs to be mirrored over. Again, it could’ve been sdb. For one last check, get the output of cat /proc/mdstat again:

Notice it says sda – that means that sda has the data we want to mirror over to the other disk, which, as the previous output of dmesg showed, is sdb.

If you are replacing a failed RAID member, the replacement must be the same size or larger than the failed member. That goes for any kind of RAID level and any type (i.e. disk mirroring or partition mirroring). Keep in mind that not all disks of the same stated capacity will actually have the same capacity, so make sure you do a bit of research before going out and spending your money on a new disk that won’t fit your current array!

Now that the disk is reconnected and showing up, copy the partition layout from the existing RAIDset disk to the new disk with the following command:

sudo sfdisk -d /dev/sdX | sudo sfdisk /dev/sdY

In this case, the existing disk is /dev/sda and the new disk is /dev/sdb:

This step isn’t needed if you’re mirroring disks (as opposed to mirroring partitions), but it’s a good idea to do it anyway – if there’s an error here, you certainly don’t want to go any further until you’ve fixed the problem.

If sfdisk worked and didn’t give you any errors, then you’re ready to add the new disk to the RAIDset with the following command:

sudo mdadm --manage /dev/md0 --add /dev/sdY

Where sdY is the new disk – in my case, sdb:

If you didn’t get any errors, run cat /proc/mdstat again and you’ll see your RAIDset is rebuilding:

Notice how it now shows that there are two active elements in md0sdb[2] and sda[0]? That’s a good sign. Keep checking every once in a while to make sure the recovery is progressing.

Once it’s done, the RAIDset should be showing as all “U” again:

If you see that, everything’s rebuilt and your RAIDset is ready to handle another disk failure.

Hopefully you never need to use this information, but if you do, I hope it helps!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.