Experimenting with ZFS Failures

While waiting for all the drives to arrive, I built a 3-disk RAIDZ1 configuration to perform tests on. Each of the 3 drives has a capacity of 3TB. RAIDZ1 means one of the disks is used for redundancy; instead of 9TB of storage space, there’s only 6TB. For that capacity loss we gain resilience to any one of the drives failing. If a drive were to fail, we could simply replace it and ZFS would continue like nothing had happened! Let’s try some experiments and see how that works.

Working Configuration

These shots from NAS4Free show the three disks, configured into one Virtual Device (vdev), inside one pool; the pool has one dataset.

Three disks:

 

Inside one pool:

 

The pool has one “virtual device”, a RAID-Z1.

 

The disks are bound into one dataset in the pool.

To begin with, the pool is ‘ONLINE’ and all three disks are working fine.

Unplugging cables

Let’s simulate a drive going bad; what happens if we unplug the SATA cable from one of the drives?

 

The pool is now DEGRADED and one of the drives is marked UNAVAIL. Uh oh! ZFS tells us to use zpool online to bring the drive back.

Even in this degraded state, I’m able to access my data – in fact our home folder (~) is located on this dataset, and operating perfectly. Even with only 2 out of 3 drives running. (If we were to lose a second drive, our data would be inaccessible).
Let’s re-attach the 3rd drive’s SATA cable and tell ZFS to online the drive.

It all works great; I wonder why it says it “resilvered 68K”. There are several MB of data in the pool. 68K is perhaps just some metadata.

Moving cables

How about we unplug a SATA cable from the motherboard and connect it to a different SATA port?

ZFS didn’t blink an eye; we didn’t even have to online the drive.

Let’s pretend the entire motherboard needed replacing, and we forgot which drives were plugged in where. I shutdown; unplug all the SATA connectors and re-connect them in different connectors. Power on and what happens?

Switching connections around made no difference at all! If that had been a real motherboard replacement I would have had not had to worry where the drives were connected.

New boot drive

Things get more interesting here. I’ve been booting this box off a 16GB USB flash drive. What if this drive went bad?

The boot drive contains the ZFS configuration; losing that means the fresh NAS4Free installation will need to discover what state ZFS and the drives were in. NAS4Free does recommend backing up your configuration, but let’s say you forgot to…

I installed NAS4Free on a new 8GB flash drive. I told it to configure the network card (option 2 on the main menu) and then visited the displayed IP address from my Mac’s browser.

Uh oh. Nothing here in ZFS-land! No pool, no disks!

Here’s the configuration screen, with no pool but a useful button labeled “Import on-disk ZFS config”. Later we can see what ZFS commands the button runs.

After clicking that – look what happened!

The ZFS pool, vdev, and dataset are back! While the Pools and Datasets web pages still show nothing, we can fix that, too – read on.
At this point I enabled SSH access so I could have a poke around and try to access some data. I enabled root SSH access, and was able to navigate to the ZFS dataset directory!

zpool status shows the pool as ONLINE.

It’s educational to run zfs history, which shows all the commands that were used to create the pool and also what command was executed when we imported after booting from the fresh USB drive:

Though NAS4Free’s WebGUI showed no pools or datasets, I was able to fix that using the ZFS -> Configuration page, which has a “Synchronize” button. After using that the rest of the WebGUI shows the pool and dataset correctly. This also fixed the Disks -> Management page, which had been showing no disks. As far as I can tell, that puts everything back the way it was (as far as ZFS goes – you may have had SAMBA shares etc too; so remember to backup your NAS4Free configuration!)

Summary

I simulated loss of a hard drive, loss of motherboard, and loss of boot USB drive. These simulations of failures turned out to all be recoverable situations! No data was lost at any step, which is great news for anyone with data they want to keep safe.

Note that even ZFS is not a substitute for backups – preferably off site, e.g. at a family member’s house or in a bank vault. An errant script or accidental manual file deletion means that ZFS will safely replicate that deletion across its RAID. ZFS snapshots could help here, but even so, your box is still vulnerable to flood, a lightning strike, power surge, or brown out which could damage one or all of the hardware components.
So far though I’m very pleased that ZFS, FreeBSD, and NAS4Free have lived up to their claims and provided a safe haven for my data!

I’ll be adding three more drives and setting up RAIDZ2. This will allow data access even with two drives gone. Research has shown that RAIDZ1 is not as safe an option as you might think – once one drive goes bad the odds of a second following it shoot up, and may not give you time to resilver a replacement disk.

References:

http://www.nas4free.org

General overview and advice for ZFS, Freenas, and configuration: https://forums.freenas.org/index.php?threads/slideshow-explaining-vdev-zpool-zil-and-l2arc-for-noobs.7775/