Assume the Opposite: RAID redo

I spent some time this weekend rebuilding the family file server, which is a Linux box. The last time I rebuilt it was several years ago, and at that time I figured I should set it up with (software) RAID 5 to avoid the hassle of having the recover from backup if a disk failed. This worked great. A few years ago a disk did fail, I bought a new one, plugged it in, it rebuilt and everything was good.

Similarly, a few years back at work I configured our new VMware ESXi box with an eight disk RAID 5 array (hardware RAID). Last year a disk failed, and on that machine I didn't even have to power it down. I yanked out the old disk, hot-plugged the new, and the machine didn't miss a beat.

So, RAID 5 is wonderful, right? Well, the time between the disk failure and disk replacement was somewhat stressful. In both cases, the disk couldn't be replaced immediately. The disk in my home server failed the night before I left on a trip, so I couldn't replace it for two weeks. And the new disk for the work machine had to be ordered and took some time to arrive. There was this gap where there was no redundancy. In both cases there were backups, but restoring from backup takes a lot more time than just plugging in a disk, and I realized that I really, really didn't want to waste my time setting up machines when simply providing a little more redundancy would have removed the need. “You can ask me for anything you like, except time.”

So the home server needed a bit of maintenance (for example, the root volume was low on space) so I figured while I was doing that I would reorganize the server and take some extra time to fix the redundancy problem, moving to RAID 6 on the four disks. RAID 6 would allow two disks to fail without loss of data. I'd lose some space but the extra redundancy would be worth it. Why RAID 6 over RAID 10? Well, RAID 6 provides better error checking at the expense of some speed.

This is what I did to prepare:

Took an LVM snapshot of the root partition and copied that snapshot as an image to an external drive. Why an image? Sometimes the permissions and ownership of files are important, I like to preserve that metadata for the root partition.
Copied the truly critical data on the root partition to another machine for extra redundancy. The existing backup process copies the data offsite, which is good for safety but not so good for quick recovery, so I wanted to make sure I didn't have to use the offsite backup.
Copied the contents of the other partitions to the external drive. The other partitions don't contain anything particularly critical so I didn't feel the need for redundancy there.
Zero'd out all the drives with dd if=/dev/zero of=/dev/sdX. Some sites suggested this was important, that the Linux software RAID drivers expected the disks to be zeroed. It seems unlikely but it didn't cost me anything to do it. There was an interesting result here, though: the first two drives ran at 9.1Mb/s, while the second two ran at 7.7Mb/s. If I recall correctly there are three identical drives and the one I replaced which is a different brand, so it isn't a drive issue but rather a controller issue: the secondary controller is slower.

Now that the machine was a blank slate, I set it up from scratch:

Start the Debian 6.0.3 installer from a USB key.
In the installer, partition each of the four disks with two partitions: one small 500M partition and one big partition with the rest of the space (~500G).
Set up RAID 1 across the small partitions (four-way mirroring).
Set up RAID 6 across the large partitions.
Format the RAID 1 volume as ext3, mounted as /boot.
Create an LVM volume group called “main” and added the RAID 6 volume to it.
Create a 5G logical volume for /.
Create a 5G logical volume for /home.
Create a 10G logical volume for swap.
Create a 20G logical volume for /tmp.
Create a 200G logical volume for /important.
Create a 200G logical volume for /ephemeral.
Tell the installer that this machine should be a DNS, file, and ssh server and let the installer run to completion.
Copy the important files to /important and the ephemeral files to /ephemeral.
Configure Samba and NFS.

So why this particular structure? Well, Linux can't boot from a software RAID 6 partition, so I needed to put /boot on something that Linux could boot from, therefore the RAID 1 partition. The separate logical volumes are primarily about different backup policies. The 5G size for / and /home is to limit growth (these volumes will be backed up as filesystem images) and 5G fits on a DVD for backup, in case I want to do that at some point. Swap of course needs to be inside the RAID array if you don't want the machine to crash when a disk fails: yes Linux knows how to efficiently stripe swap across multiple disks but a disk failure will cause corruption or a crash. The 20G volume for /tmp is so that there's lots of temp space and it's on a separate volume so backup processes can ignore it. The /important volume contains user files that are the important data and can be backed up on a file-by-file basis (as opposed to / which is backed up as an filesystem image). The /ephemeral volume contains files that don't need to be backed up. All filesystems have the noatime mount flag set, and they're all ext4 except for /boot which is ext3.

If you're counting you'll note that there is still a lot of empty space in that LVM volume group. There are several reasons for this:

Some empty space is required if I want to make an LVM snapshot, so I never want to use up all the space.
I frequently make additional temporary volumes for a variety of purposes.
If I need to expand any particular logical volume, there is room to do so.

Assume the Opposite

Monday, January 30, 2012

RAID redo

No comments:

Post a Comment