tjll.net

Tyblog | When Disks Die: A ZFS Recovery Post-Mortem

I read a lot of tech success stories, but most of them revolve around building out or creating cool stuff. Last week, I had a catastrophic disk failure, and all I wanted was to find some recorded notes about disk recovery in Linux with ZFS. This is a record of my experience to illustrate the strength and maturity of ZFS on Linux and potentially help anyone in a similar situation in the future.


This is a companion discussion topic for the original entry at https://blog.tjll.net/when-disks-die-zfs/

In future give Ceph a try. I switched from a FreeNAS setup with RAIDZ2 over to it and it’s been pretty great but it’s especially good at handling hardware failure. A Ceph cluster is basically a bunch of JBOD drives across one or more machines and ceph strives to keep a certain number of copies of any data on that cluster. In the event of a drive failure ceph says “oh dear, not enough copies of files X Y and Z, I better make some more!” and makes further copies of any data that was on the dead drive. Your storage stays accessible and usable but with reduced total capacity. If drives keep failing that’s fine so long as you have enough space for the minimum number of copies of your data that you specify (3 default) and at least as many places to put data.

The major downsides are that it’s not easy to run and at 33% space efficiency (default) it’s hungry for drives.

I’m actually experimenting with distributed filesystems on the ODROID HC2 since I reached the same conclusion you did about availability. Right now I’ve got a working 3-node cluster on glusterfs, I haven’t tried Ceph yet though. I shied away from Ceph initially because it’s somewhat more complex than glusterfs, how has management for Ceph been for you?

I’ve also been trying to figure out the best way to configure an initial volume/dataset for easy expansion later, but both Ceph and gluster look like they need to create a new volume to change parameters/add disks - for example, my gluster 2-1 disperse volume can’t add a disk dynamically, I’d have to create a new volume to add another storage brick. Ceph erasure pools seem to also need to be recreated to change parameters. What type of pool are you using?

Ooh nice, I tried to get ceph up on my Orange Pi plus 2e which failed in the end as 32bit ARM packages aren’t available for recent releases and I couldn’t get it to compile either. 64bit ARM has packages, I’ll pick up something A53 based eventually and give it another try.

Management for my home deployment has been mostly hands-off but I was previously operations staff at a datacentre where we ran an 80 node ~500 spindle cluster so I know that it ties into Graphite nicely among other things. There are puppet and ansible modules if config management is your thing. For home I’ve got a private github repo with the keys and config stored within.

I don’t know erasure pools well, they’re comparatively new and as far as I know less flexible. I run replicated at home so adding a disk is easy enough (ceph-deploy!) and then I generally don’t have to do anything unless my placement groups per disk are a bit low (non-optimal placement, disk space utilisation) in which case I bump up the PGs per pool. Maybe in the future I need to shrink, you can’t do that to a pool but you can make a new one with a low PG count then copy your data across to it.

Oh yeah and you can make this a bit easier by using Ceph-FS on top. Support for it’s been in the linux kernel for a while now so on my other VMs (plex, torrents etc.) there’s one line in fstab and a key file. That’s it, connected to storage.