Replacing a drive

I’ll replace one of the 3TB drives with a 4TB drive. This’ll allow for a size upgrade (eventually–once all the drives are replaced with 4TB drives). It also means I’m rotating out older drives with new ones. I added a sticker to the drive to show the date, so future me can see which drives are oldest.

With the old drive removed, zpool status shows the pool as “degraded” with a drive missing.


NAME                     STATE     READ WRITE CKSUM
tank                     DEGRADED     0     0     0
  raidz2-0               DEGRADED     0     0     0
    ada0                 ONLINE       0     0     0
    ada1                 ONLINE       0     0     0
    ada2                 ONLINE       0     0     0
    ada3                 ONLINE       0     0     0
    ada5                 ONLINE       0     0     0
    9875896178717210589  UNAVAIL      0     0     0  was /dev/ada6

Plugging in the new drive makes no change here. Off to NAS4Free’s “Disk -> Management” screen. It shows a warning saying the physical devices have changed, and to import disks with the “clear configuration” option enabled. Do that, and Apply Changes. The disk is now listed normally, but with the Filesystem marked “unknown or unformatted”.

Now to the “ZFS -> Tools” screen. Select ‘replace a device’. Select the pool, tap next. Select “ada6” and tap next. It ran “zpool replace ‘tank’ ‘/dev/ada6′”, and now the status shows it silvering the new drive.


  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Jan 19 16:29:06 2018
        144G scanned out of 5.07T at 538M/s, 2h40m to go
        24.0G resilvered, 2.77% done
config:

NAME                       STATE     READ WRITE CKSUM
tank                       DEGRADED     0     0     0
  raidz2-0                 DEGRADED     0     0     0
    ada0                   ONLINE       0     0     0
    ada1                   ONLINE       0     0     0
    ada2                   ONLINE       0     0     0
    ada3                   ONLINE       0     0     0
    ada5                   ONLINE       0     0     0
    replacing-5            UNAVAIL      0     0     0
      9875896178717210589  UNAVAIL      0     0     0  was /dev/ada6/old
      ada6                 ONLINE       0     0     0  (resilvering)

Experimenting with ZFS Failures

While waiting for all the drives to arrive, I built a 3-disk RAIDZ1 configuration to perform tests on. Each of the 3 drives has a capacity of 3TB. RAIDZ1 means one of the disks is used for redundancy; instead of 9TB of storage space, there’s only 6TB. For that capacity loss we gain resilience to any one of the drives failing. If a drive were to fail, we could simply replace it and ZFS would continue like nothing had happened! Let’s try some experiments and see how that works.

Working Configuration

These shots from NAS4Free show the three disks, configured into one Virtual Device (vdev), inside one pool; the pool has one dataset.

Three disks:

 

Inside one pool:

 

The pool has one “virtual device”, a RAID-Z1.

 

The disks are bound into one dataset in the pool.

To begin with, the pool is ‘ONLINE’ and all three disks are working fine.

Unplugging cables

Let’s simulate a drive going bad; what happens if we unplug the SATA cable from one of the drives?

 

The pool is now DEGRADED and one of the drives is marked UNAVAIL. Uh oh! ZFS tells us to use zpool online to bring the drive back.

Even in this degraded state, I’m able to access my data – in fact our home folder (~) is located on this dataset, and operating perfectly. Even with only 2 out of 3 drives running. (If we were to lose a second drive, our data would be inaccessible).
Let’s re-attach the 3rd drive’s SATA cable and tell ZFS to online the drive.

It all works great; I wonder why it says it “resilvered 68K”. There are several MB of data in the pool. 68K is perhaps just some metadata.

Moving cables

How about we unplug a SATA cable from the motherboard and connect it to a different SATA port?

ZFS didn’t blink an eye; we didn’t even have to online the drive.

Let’s pretend the entire motherboard needed replacing, and we forgot which drives were plugged in where. I shutdown; unplug all the SATA connectors and re-connect them in different connectors. Power on and what happens?

Switching connections around made no difference at all! If that had been a real motherboard replacement I would have had not had to worry where the drives were connected.

New boot drive

Things get more interesting here. I’ve been booting this box off a 16GB USB flash drive. What if this drive went bad?

The boot drive contains the ZFS configuration; losing that means the fresh NAS4Free installation will need to discover what state ZFS and the drives were in. NAS4Free does recommend backing up your configuration, but let’s say you forgot to…

I installed NAS4Free on a new 8GB flash drive. I told it to configure the network card (option 2 on the main menu) and then visited the displayed IP address from my Mac’s browser.

Uh oh. Nothing here in ZFS-land! No pool, no disks!

Here’s the configuration screen, with no pool but a useful button labeled “Import on-disk ZFS config”. Later we can see what ZFS commands the button runs.

After clicking that – look what happened!

The ZFS pool, vdev, and dataset are back! While the Pools and Datasets web pages still show nothing, we can fix that, too – read on.
At this point I enabled SSH access so I could have a poke around and try to access some data. I enabled root SSH access, and was able to navigate to the ZFS dataset directory!

zpool status shows the pool as ONLINE.

It’s educational to run zfs history, which shows all the commands that were used to create the pool and also what command was executed when we imported after booting from the fresh USB drive:

Though NAS4Free’s WebGUI showed no pools or datasets, I was able to fix that using the ZFS -> Configuration page, which has a “Synchronize” button. After using that the rest of the WebGUI shows the pool and dataset correctly. This also fixed the Disks -> Management page, which had been showing no disks. As far as I can tell, that puts everything back the way it was (as far as ZFS goes – you may have had SAMBA shares etc too; so remember to backup your NAS4Free configuration!)

Summary

I simulated loss of a hard drive, loss of motherboard, and loss of boot USB drive. These simulations of failures turned out to all be recoverable situations! No data was lost at any step, which is great news for anyone with data they want to keep safe.

Note that even ZFS is not a substitute for backups – preferably off site, e.g. at a family member’s house or in a bank vault. An errant script or accidental manual file deletion means that ZFS will safely replicate that deletion across its RAID. ZFS snapshots could help here, but even so, your box is still vulnerable to flood, a lightning strike, power surge, or brown out which could damage one or all of the hardware components.
So far though I’m very pleased that ZFS, FreeBSD, and NAS4Free have lived up to their claims and provided a safe haven for my data!

I’ll be adding three more drives and setting up RAIDZ2. This will allow data access even with two drives gone. Research has shown that RAIDZ1 is not as safe an option as you might think – once one drive goes bad the odds of a second following it shoot up, and may not give you time to resilver a replacement disk.

References:

http://www.nas4free.org

General overview and advice for ZFS, Freenas, and configuration: https://forums.freenas.org/index.php?threads/slideshow-explaining-vdev-zpool-zil-and-l2arc-for-noobs.7775/