Custom Search

Monday, January 25, 2010

Some better practices for ZFS on FreeBSD

Rather than working on the clojure web framework, I've been dealing with broken hardware, including some system reinstalls. So let's talk about that.

ZFS has been available in FreeBSD for a while now, and in the recent released 8.0 is now considered production quality. There are a number of write ups on the web about how to set up various configurations of FreeBSD on ZFS: with a UFS boot, on a GPT mac drive, with no UFS at all, etc. Most seem to have one thing in common - they just duplicate a standard FreeBSD UFS file system configuration, without seeming to consider how ZFS has changed the game. Not really the fault of the author; I did much the same when I set up my first ZFS system a few years ago. But having those few years experience - and seeing how the OpenSolaris folks set things up - indicates that there are better ways. I want to talk about that in hopes of getting others to spend more time thinking about this.

First, a note on terminology. Those of you familiar with FreeBSD on X86 can skip this. Unix has had "partitions" since before there was a DOS. FreeBSD continues to call them that. What the DOS folks - and most everyone else - calls partitions are called "slices". A FreeBSD installation typically has one slice for FreeBSD on the disk, with multiple partitions - one per file system - in that slice. Slices are numbered starting at 1. Partitions are lettered, usually a-h. A typical FreeBSD partition name is ad0s1a, meaning drive number 0 on the ATA controller, slice 1, partition a.

Now a quick overview of how to set up FreeBSD with a ZFS root file system. Details area easy to find in google if you need them;

  1. Partition the drive, providing a swap and data partition. If you're using GPT for partitioning, you'll need a boot partition as well. Note that on OpenSolaris, giving ZFS a partition is a bad idea, as it disabled write caching on the drive because OpenSolaris has file systems that can't handle drive write caching. On FreeBSD, all the file system handle drive write caching properly, so this isn't a problem.

  2. Create a zfs pool on that partition.

  3. Install the system onto an fs in that pool. Most people seem to like copying the files from a running system. I used the method documented in /usr/src/UPDATING to install to a fresh partition. For that to work cleanly, you'll want  NO_FSCHG defined in /etc/make.conf, or -DNO_FSCHG on the command line, as FreeBSD's zfs doesn't do attribbutes. You'll also need to make sure that /boot/loader was built with LOADER_ZFS_SUPPORT defined.

  4. Install a boot loader. Just install the appropriate ones for your partitioning scheme.

  5. Config for zfs. You may want to set the bootfs property on your pool to the root file system to tell the boot loader where to find /boot/loader. You'll want to set zfs_load="YES" and vfs.root.mountfrom="zfs:data/root/fs" in /boot/loader.conf to tell the loader where the root file system is. Set zfs_enable="yes" in /etc/rc.conf so the system knows to turn on zfs. Finally, to prevent zfs from trying to mount your root file system a second time, set the mountpoint property to "legacy" on that file system.

  6. Last step: export and import the resulting pool, then copy /boot/zfs/zpool.cache to the same location on your new system.

Again, this is a quick overview. Google for details if you need them.

Now to the point - how to set up your filesystems under ZFS, considering how ZFS has changed the game.

For instance, it's much more robust than the UFS file systems, so there's little point in creating partitions to protect things from corruption - though the UFS file systems have been solid enough for that for a while. Likewise, ZFS file systems aren't locked to a pool of blocks, so there's not much point creating file systems to allocate disk space to different purposes - though you can put limits on a specific file system if you want to. Those are the classic reasons to set up new file systems.

With ZFS, file systems are cheap and easy to create. The reason for creating one is that you want to have different properties on it than on it's parent, or that you might want to have a group of file systems inherit some set of properties. You might also want to use the spiffy ZFS snapshot/clone tools on some file system without using it on them all.

So, the first thing to notice is that it's easy to boot from a different root file system. Booting from a UFS file system needs the root on partition a - unless that's been changed and I didn't notice - meaning you need to create a new slice for it, and possibly put it on a new disk. With ZFS, you can boot from a new root file system by changing two settings: bootfs on the boot pool, and vfs.root.mountfrom in /boot/loader.conf  (i'm sure one of those will vanish at some point) and rebooting. So you could, in theory, have a couple of different versions of FreeBSD installed on the same pool, and boot between them.

In fact, that looks like it's worth automating, as it's trivial to do, and will cut down the number of places you can typo the pool name. So here's
POOL=$(echo $FS | sed 's;/.*;;')
$DEBUG zfs set bootfs=$FS $POOL
$DEBUG sed -i .old "/vfs.root.mountfrom/s;=.*;=\"zfs:$FS\"" /boot/loader.conf
This is simple enough I didn't add options; to debug it, just run it as "DEBUG=echo newrootfs". Better yet, grab the tool cryx mentioned in the comments from

You can do this with your root file system as the top file system in the pool, but that's going to get confusing eventually. Best to sort things out into their own group. So my suggestion is that the root file system be something like "data/root/8.0". An 8-STABLE root might be "data/root/8-STABLE".

You can even avoid having to do fresh installs on each new system - unless you want to - by leveraging the power of zfs. To wit:
zfs snapshot data/root/8.0@NOW
zfs clone data/root/8.0@NOW /data/root/8-STABLE
mount -t zfs data/root/8-STABLE /altroot
cd /altroot/usr/src
make update
# proceed with bind and install to /altroot, rather than modifying your running system.
can now boot that, and try a new version of  FreeBSD - without having to change your old file system. If it doesn't work, just reset the bootfromzfs values, delete the file system, and try again later. Or ignore it until you feel like updating it and trying again later.

So, what things would we want to share between two different versions of FreeBSD this way? Not /usr - the userland is pretty tightly tied to the kernel in FreeBSD. usr/local? Certainly - packages and anything you build will work on updates to the current release, and on the next one (or more) with the appropriate compatibility options. For that matter, /usr/ports probably wants to be it's own file system since the ports project explicitly supports multiple versions. /etc? Maybe. The system knobs can probably be shared, but some applying and some not for each system will be confusing. On the other hand, the ports/package system writes to /etc/make.conf as well. If you're not running a mirrored root, you might consider making /etc a file system just to set "copies=2" to improve reliability. /home? Certainly it should be shared. /var? Most likely, as ports put info in there as well as the normal spool things. In fact, enough different things go on there you may want it to be a file system so you can create subfilesystems with the appropriate properties. If you're exporting file systems, you can create an fs to hold them, and set the appropriate property on it to the most common value so you don't have to set it on it's children. The file systems underneath that will then all be exported automatically.

That said, you might want to do step 2 above something like so:
zpool create data ad0s1
zfs create -p data/root/8.0
zfs create -o mountpoint=/home data/home
zfs create -o mountpoint=/usr/ports -o compression=on data/ports
zfs create -o compression=off data/ports/distfiles
zfs create -o mountpoint=/export -o sharenfs="rw=@"  data/export
zfs create -o mountpoint=/var data/var
zfs create -o copies=2 data/root/8.0/etc
You can also set the properties exec (allow - or not - execution of files on that fs) and setuid (honor - or not - the setuid bit) as appropriate for each of these file systems. /var, in particular, bears a closer look. You might consider turning off setuid and exec on it and most of it's descendants. /var/log might be compressed unless you send all your logs elsewhere. /var/db/pkg is a candidate for compression. Some database packages install things in /var/db as well; in which case you might want to check to the zfs best practices wiki for that database.

One final note. I mentioned a mirrored root pool. I run most of my systems this way, and recommend it.They meant that, even though the hardwares did cost me time, it was to repair them, not because the services in question were unavailable. Setting up the mirror is simple. You'll need to install the boot loaders on both disks - that's step 4 above for the second disk. You also need to use a mirror vdev on the zpoolc create command. That changes the command to something like "zpool create data mirror ad0s1 ad1s1". The rest of the zfs commands will be the same; the only thing that knows that the pool is mirrored is zpool.