ZFS

ZFS is an advanced filesystem, originally developed and released by Sun Microsystems for the solaris operating system. Nowadays ZFS usually refers to the fork OpenZFS, which ports the original implementation to other operating systems, including Linux, while continuing the development of solaris ZFS. This article will treat ZFS as synonmous with OpenZFS.

ZFS is featureful. Features of ZFS include: an improved page cache algorithm: ARC, deduplication, pooled storage, snapshots, replication, data integrity verification and automatic repair (scrubbing), RAID-Z and much more.

Concepts

Unlike regular filesystems where a filesystem reside on a single block device, ZFS stores data in storage pools. Pools are composed of vdevs (virtual devices) which are themselves composed of block devices. Pools always write data to the vdev with most percentage of free space (so data are naturally striped across vdevs). Vdevs on the other hand can use more complex configurations such as RAIDZ and mirror. In the simplest configuration a pool can be created on a single vdev composed of a single partition, which behaves similar to regular filesystems.

Once created, storage resources can be allocated from the pool. Such resources are grouped into units of what is called datasets. There are 4 types of datasets:

file system: File systems are basically a directory tree and can be mounted like regular filesystems into the system namespace.
volume (zvol): A volume represented as a block device
snapshot: A snapshot of a file system or a volume
bookmark: A snapshot that does not hold data, used for incremental replication.

Datasets are identified by a unique path with the following syntax:

pool(/segment)+((#|@)bookmark/snapshot name)?

where # is used for bookmarks and @ is used for snapshots.

Note: This is a short summary of zpoolconcepts(7) and zfsconcepts(7). It is highly recommended to read them to get familiar with the concepts as well as technical terms not covered here.

Complications of being out-of-tree

Due to complex legal reasons, the Linux kernel maintainers refuse to accept ZFS into the linux kernel. As such, ZFS is developed as an out-of-tree module. A consequence of this arrangement is that kernel updates will break the kernel API that ZFS uses from time to time. Whenever this happens, ZFS would have to change their code to adapt to this new API. This means there will be a time period where ZFS does not work on the latest mainline kernel release.

Tip: Since the linux package closely tracks the updates of the latest stable branch, it would be a better idea to use linux-lts instead if you do not want to pin the linux package to an unsupported version.

Installation

As an out-of-tree module, there are 2 types of packages you can choose to install. Either installing it as a binary kernel module built against a specific kernel version or installing its source as a DKMS module that gets automatically rebuilt anytime the kernel updates.

In addition to the kernel modules, users would also need to install userspaces tools such as zpool(8) and zfs(8). These userspace tools are usually packaged as a single package called zfs-utils*.

All of the kernel module packages mentioned below correctly specified their dependency on their corresponding zfs-utils* packages so when installing you just need to satisfy their dependency.

Binary kernel module

Comparison of packages
Package	Repository	ZFS release type	Targeting kernel	Binary package	Notes
zfs-linux-lts^AUR	AUR	Stable	linux-lts	No	Highly recommended to use devtools when building newer versions of the ZFS package as otherwise you'd have to uninstall your current ZFS package to upgrade the kernel.
zfs-linux^AUR	AUR	Stable	linux	No
zfs-linux-lts-poscat	archlinuxcn	Stable	linux-lts	Yes	Automatically updated and rebuilt against new kernel versions by a build bot every few hours, additionally provides a mkinitcpio hook for systemd based initrds
zfs-linux-lts-rc-poscat	archlinuxcn	Release candidate	linux-lts	Yes
zfs-linux	archzfs	Stable	linux	Yes
zfs-linux-lts	archzfs	Stable	linux-lts	Yes

DKMS

zfs-dkms^AUR: stable release
zfs-dkms-staging-git^AUR: stable release with cherry picked patches for compatibility with latest linux kernel

Root on ZFS

See Install Arch Linux on ZFS.

Configuration

Importing pools at startup

ZFS provides systemd services for automatically importing pools and targets for other units to determine the state of ZFS initialization. These are:

zfs.target which is reached when all ZFS services completes
zfs-import.target which is reached when ZFS pools finish importing
zfs-volumes.target which is reached when zvols all appear under /dev
zfs-import-scan.service which imports pools by scanning for devices using libblkid
zfs-import-cache.service which imports pools by consulting a zpool.cache file
zfs-volume-wait.service which waits for all zvols to be available.

You should choose one between zfs-import-scan.service and zfs-import-cache.service and enable the rest.

zfs-import-scan

zfs-import-scan.service uses zpool import's default logic of scanning devices using blkid, this means no zpool.cache files are needed. This is the recommended method since zpool.cache is deprecated.

It is important to make sure none of your pools are imported with the cachefile option enabled since zfs-import-scan.service will not start if zpool.cache exists and is not empty. You can achieve this by enabling the zfs_autoimport_disable option of the zfs module. You should also either remove the existing zpool.cache or setting cachefile to none for all imported pools when booting.

Warning: It should be noted that if root is used on ZFS the best method remains zfs-import-cache, because the pool with root at each boot will change its cachefile property to default and the zpool.cache file in /etc/zfs will be recreated, making zfs-import-scan unusable. A workaround would be to recreate an empty zpool.cache file with touch and make it unbootable with chattr +i, but then upon generation of the initramfs with mkinitcpio, the system would be unbootable for corrupted zpool.cache.

zfs-import-cache

zfs-import-cache.service uses zpool import -c <zpool.cache> when importing pools which reads device paths from the zpool.cache.

Using this method means you need to be conscious about the device paths while creating ZFS pools, since some device paths may change between boots or hardware modifications, which would lead to stale cache and failure of pool imports. See Persistent block device naming on how to choose devices paths that are persistent.

Automatically mounting filesystems

The services zfs-import-scan.service and zfs-import-cache.service will import the pools without mounting any filesystems. To also mount filesystems on startup there are 2 methods, depending on if your filesystems are configured using mountpoint=legacy or not. If your filesystems are configured with a mix of legacy mount and non-legacy mount you'll need to use both methods.

zfs-mount-generator

If your filesystems use non-legacy mount, it is recommended to use zfs-mount-generator, which is a systemd.generator(7) that generates systemd mount units for all filesystems of imported zfs pools with the property canmount=on to mount filesystems on boot. By default though zfs-mount-generator won't do anything since it requires zfs list caches. You need to:

Enable and start zfs-zed.service.
Create the /etc/zfs/zfs-list.cache directory.
```
# mkdir -p /etc/zfs/zfs-list.cache
```
Enable tracking for individual pools by creating empty files named after your pools in /etc/zfs/zfs-list.cache/. Zed will only update the list of filesystems if the file for the pool already exists and is writable.
```
# touch /etc/zfs/zfs-list.cache/pool-name
```
Check the contents of /etc/zfs/zfs-list.cache/pool-name. If it is empty, zed has not detected any events and therefore has not refreshed the cache file. Change a monitored property somewhere on the pool to emit a new ZFS event which will trigger the ZEDLET and refresh the cache file. The first example below assumes relatime is enabled on the pool zroot and that it is inherited by the child filesystem fs1. Trigger an event by changing a property and then reverting it:
```
# zfs set relatime=off zroot/fs1
# zfs inherit relatime zroot/fs1
```
If inherit is not available (e.g. you are working with the root pool) simply make the change on the pool itself and revert it manually rather than using inherit:
```
# zfs set relatime=off zroot
# zfs set relatime=on zroot
```
You can check the file from step 3 to make sure it is populated. The pool will now be mounted on boot. Reboot to see it mounted, or restart zfs-zed.service.

fstab

If your filesystem uses legacy mount, then you should specify the mountpoint in the fstab file. The device field should be the name (full path) of your filesystem and the dump and fsck fields should be left as 0.

Creating a hostid file

While not strictly necessary, it is usually a good idea to create a /etc/hostid file:

# zgenhostid $(hostid)

Storage pools

Experimenting with ZFS

Users wishing to experiment with ZFS with no possibility of real data loss can refer to ZFS/Virtual disks.

Creating ZFS pools

Tip: One may want to read #The ashift property first as it may be recommend to set ashift on pool creation.

To create a ZFS pool:

# zpool create -R <root> -o <poolopts> -O <dsetprops> <pool> <vdevs>

Where each vdev is either a device or has the format:

<vdev type> <device> ... <device>

-R: Mounts all filesystems under this directory, useful for not interfering with existing system
-o: Specifies the properties of the pool, can be used multiple times. Some properties such as ashift cannot be changed once created. (technically ashift is per vdev but to configure it per vdev you'll need to use zpool add)
-O: Specifies the properties of the root dataset of the pool, can be used multiple times. Some properties such as normalization cannot be changed once created.
pool: This is the name of the pool.
vdev type: See zpoolconcepts(7) for a list of supported vdev types.
device: A block device, can be either a full path or the file name part of the path

Note: Depending on the method you chose to use to mount the pools, you might need to be conscious about the device path you use to create pools.

For example, to create a pool on a single partition:

# zpool create -R /mnt pool /dev/sda

To create a pool with a single raidz1 vdev:

# zpool create -R /mnt pool \
               raidz1 \
                  ata-ST3000DM001-9YN166_S1F0KDGY \
                  ata-ST3000DM001-9YN166_S1F0JKRR \
                  ata-ST3000DM001-9YN166_S1F0KBP8 \
                  ata-ST3000DM001-9YN166_S1F0JTM1

To create a pool with two mirror vdevs:

# zpool create -R /mnt pool \
               mirror \
                  ata-ST3000DM001-9YN166_S1F0KDGY \
                  ata-ST3000DM001-9YN166_S1F0JKRR \
               mirror \
                  ata-ST3000DM001-9YN166_S1F0KBP8 \
                  ata-ST3000DM001-9YN166_S1F0JTM1

The ashift property

ashift is an immutable per-vdev property that determines the (logical) sector size, which is 2^ashift bytes. For optimal performance, the logical sector size should always be greater than or equal to the physical sector size of the disks.

By default, zpool create should be able to correctly determine the physical sector size of the device. This should suffice in a single disk setup.

However, if you are (or intend to) using vdev setups that can replace failed disks, such as mirror or raidzX, it is usually a good idea to always use ashift=12, since using 4kb logical sector size on a 512b physical disk won't have a performance penalty while the converse will. (unless your devices are the few rare SSDs that uses an 8kb sector size)

Tip: Use

$ lsblk --filter 'TYPE=="DISK"' -o NAME,PHY-SEC

to check the physical sector sizes of your disks.

Additionally, if you are using NVMe drives, it might be possible to format it using a more performant LBA format than the factory default (see nvme-format(1)).

GRUB-compatible pool creation

By default, zpool create enables all features on a pool. If /boot resides on ZFS when using GRUB you must only enable features supported by GRUB otherwise GRUB will not be able to read the pool. ZFS includes compatibility files (see /usr/share/zfs/compatibility.d) to assist in creating pools at specific feature sets, of which grub2 is an option.

You can create a pool with only the compatible features enabled:

# zpool create -o compatibility=grub2 $POOL_NAME $VDEVS

Verifying pool status

If the command is successful, there will be no output. Using the mount command will show that the pool is mounted. Using zpool status will show that the pool has been created:

# zpool status

  pool: bigdata
 state: ONLINE
 scan: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        bigdata                                    ONLINE       0     0     0
          -0                                       ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0KDGY-part1  ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0JKRR-part1  ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0KBP8-part1  ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0JTM1-part1  ONLINE       0     0     0

errors: No known data errors

Importing a pool created by id

In accordance with the zpool-import.8 manual, to import an existing pool you need to use flags:

-c cachefile Reads configuration from the given cachefile that was created with the cachefile pool property. This cachefile is used instead of searching for devices.

-d dir/device Uses device or searches for devices or files in dir. The -d option can be specified multiple times.

Warning: Avoid using kernel names e.g. /dev/sda for operations that require saving a new cache file. If you will use these kernel names by saving the cache file, in case they change the system will have problems to automatically mount the pools.

Note: Use ID names inside /dev/disk/by-id/ to mount an existing pool when cachefile cannot be used or if you want to mount it with other properties/mount point.

To import an existing pool to run a chroot on it consult Export/Import dei pool(Italiano)

Note: Unfortunately, there are changes in place to this page and the Install Arch Linux on ZFS page, and several key commands have been removed from the guides.

Destroying a storage pool

To destroy a whole pool:

# zpool destroy <pool>

And now when checking the status:

# zpool status

no pools available

Exporting a storage pool

To export a pool:

# zpool export <pool>

Extending an existing pool

A device (a partition or a disk) can be added to an existing zpool:

# zpool add <pool> <device-id>

Attaching a device to (create) a mirror

A device (a partition or a disk) can be attached aside an existing device to be its mirror (similar to RAID 1):

# zpool attach <pool> <device-id|mirror> <new-device-id>

You can attach the new device to an already existing mirror vdev (e.g. to upgrade from a 2-device to a 3-device mirror) or attach it to single device to create a new mirror vdev.

Renaming a pool

Renaming a pool that is already created is accomplished in 2 steps:

# zpool export oldname
# zpool import oldname newname

Setting a different mount point

The mount point for a given zpool can be moved at will with one command:

# zfs set mountpoint=/foo/bar poolname

Upgrading pools

When upgrading ZFS to a new version, new feature might be available. However for compatibility reasons ZFS won't automatically enable new features on previously created pools. Instead they need to be manually turned on for each pool.

To check for upgrade availability:

$ zpool upgrade

This system supports ZFS pool feature flags.

All pools are formatted using feature flags.

Every feature flags pool has all supported and requested features enabled.

An example output with upgradable pools:

This system supports ZFS pool feature flags.

All pools are formatted using feature flags.


Some supported features are not enabled on the following pools. Once a
feature is enabled the pool may become incompatible with software
that does not support the feature. See zpool-features(7) for details.

Note that the pool 'compatibility' feature can be used to inhibit
feature upgrades.

POOL  FEATURE

rpool
      redaction_list_spill
      raidz_expansion
      fast_dedup
      longname
      large_microzap

To upgrade a single pool:

# zpool upgrade <pool>

To upgrade all pools:

# zpool upgrade -a

Creating datasets

Users can optionally create a dataset under the zpool as opposed to manually creating directories under the zpool. Datasets allow for an increased level of control (quotas for example) in addition to snapshots. To be able to create and mount a dataset, a directory of the same name must not pre-exist in the zpool. To create a dataset, use:

# zfs create <nameofzpool>/<nameofdataset>

It is then possible to apply ZFS specific attributes to the dataset. For example, one could assign a quota limit to a specific directory within a dataset:

# zfs set quota=20G <nameofzpool>/<nameofdataset>/<directory>

To see all the commands available in ZFS, see zfs(8) or zpool(8).

Native encryption

ZFS offers the following supported encryption options: aes-128-ccm, aes-192-ccm, aes-256-ccm, aes-128-gcm, aes-192-gcm and aes-256-gcm. When encryption is set to on, aes-256-gcm will be used. See zfs-change-key(8) for a description of the native encryption, including limitations.

The following keyformats are supported: passphrase, raw, hex.

One can also specify/increase the default iterations of PBKDF2 when using passphrase with -o pbkdf2iters <n>, although it may increase the decryption time.

Tip:

To import a pool with keys, one needs to specify the -l flag, without this flag encrypted datasets will be left unavailable until the keys are loaded. See #Importing a pool created by id.
Native ZFS encryption has been made available in the stable 0.8.0 release or newer. Previously it was only available in development versions provided by packages like zfs-linux-git^AUR, zfs-dkms-git^AUR or other development builds. Users who were only using the development versions for the native encryption, may now switch to the stable releases if they wish.
The default encryption suite was changed from aes-256-ccm to aes-256-gcm in the 0.8.4 release.

To create a dataset including native encryption with a passphrase, use:

# zfs create -o encryption=on -o keyformat=passphrase <nameofzpool>/<nameofdataset>

To use a key instead of using a passphrase:

# dd if=/dev/random of=/path/to/key bs=32 count=1 iflag=fullblock
# zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///path/to/key <nameofzpool>/<nameofdataset>

The easy way to make a key in human-readable form (keyformat=hex):

# od -Anone -x -N 32 -w64 /dev/random | tr -d [:blank:] > /path/to/hex.key

To verify the key location:

# zfs get keylocation <nameofzpool>/<nameofdataset>

To change the key location:

# zfs set keylocation=file:///path/to/key <nameofzpool>/<nameofdataset>

You can also manually load the keys by using one of the following commands:

# zfs load-key <nameofzpool>/<nameofdataset> # load key for a specific dataset
# zfs load-key -a # load all keys
# zfs load-key -r zpool/dataset # load all keys in a dataset

To mount the created encrypted dataset:

# zfs mount <nameofzpool>/<nameofdataset>

Unlock/Mount at boot time: systemd

It is possible to automatically unlock a pool dataset on boot time by using a systemd unit. For example create the following service to unlock any specific dataset:

/etc/systemd/system/zfs-load-key@.service

[Unit]
Description=Load %I encryption keys
Before=systemd-user-sessions.service zfs-mount.service
After=zfs-import.target
Requires=zfs-import.target
DefaultDependencies=no

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/bash -c 'until (systemd-ask-password "Encrypted ZFS password for %I" --no-tty | zfs load-key %I); do echo "Try again!"; done'

[Install]
WantedBy=zfs-mount.service

Enable/start the service for each encrypted dataset, (e.g. zfs-load-key@pool0-dataset0.service). Note the use of -, which is an escaped / in systemd unit definitions. See systemd-escape(1) for more info.

Note: The Before=systemd-user-sessions.service ensures that systemd-ask-password is invoked before the local IO devices are handed over to the desktop environment.

An alternative is to load all possible keys:

/etc/systemd/system/zfs-load-key.service

[Unit]
Description=Load encryption keys
DefaultDependencies=no
After=zfs-import.target
Before=zfs-mount.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/zfs load-key -a
StandardInput=tty-force

[Install]
WantedBy=zfs-mount.service

Enable/start zfs-load-key.service.

Unlock at login time: PAM

If you are not encrypting the root volume, but only the home volume or a user-specific volume, another idea is to wait until login to decrypt it. The advantages of this method are that the system boots uninterrupted, and that when the user logs in, the same password can be used both to authenticate and to decrypt the home volume, so that the password is only entered once.

There are two methods for unlocking the home dataset: #PAM module or #Custom script. Both methods assume your datasets are structured similar to:

$ zfs list -o name,mountpoint,canmount,encryption

NAME                   MOUNTPOINT        CANMOUNT  ENCRYPTION
rpool                  /                 on        off
rpool/home             /home             off       off
rpool/home/user        /home/user        on        aes-256-gcm
rpool/home/user/child  /home/user/child  on        aes-256-gcm

Replace rpool/home, user, and child with the values from your setup.

PAM module

OpenZFS includes a PAM module since 2.0.0. It supports child datasets since 2.3.1.

First, use the following:

# zfs set org.openzfs.systemd:ignore=on rpool/home

This stops systemd from automatically mounting rpool/home. This property will be inherited by all child datasets.

Then create the following file:

/etc/pam.d/zfs-key

auth       optional                    pam_zfs_key.so homes=rpool/home runstatedir=/run/pam_zfs_key mount_recursively
session    [success=1 default=ignore]  pam_succeed_if.so service = systemd-user quiet
session    optional                    pam_zfs_key.so homes=rpool/home runstatedir=/run/pam_zfs_key mount_recursively
password   optional                    pam_zfs_key.so homes=rpool/home runstatedir=/run/pam_zfs_key mount_recursively

mount_recursively may be ommited if there are no child datasets in any of the users' datasets.

Finally, add the following to the beginning of /etc/pam.d/system-auth and /etc/pam.d/su-l:

auth       include      zfs-key
session    include      zfs-key
password   include      zfs-key

Custom script

First set the mountpoint to legacy to avoid having it mounted by zfs mount -a:

# zfs set mountpoint=legacy rpool/home

Ensure that it is in /etc/fstab so that mount /home will work:

/etc/fstab

rpool/home         /home           zfs             rw,xattr,posixacl,noauto        0 0

Alternatively, you can keep using ZFS mounts if you use both:

# zfs set canmount=noauto rpool/home
# zfs set org.openzfs.systemd:ignore=on rpool/home

The first will stop ZFS automatically mounting it, and the second systemd, but you will still be able to manually (or through the following scripts) mount it. If you have child datasets, org.openzfs.systemd:ignore=on will be inherited, but you will need to set canmount=noauto on each as it is not inheritable, otherwise they will try to mount without a mountpoint.

On a single-user system, with only one /home volume having the same encryption password as the user's password, it can be decrypted at login as follows: first create /usr/local/bin/mount-zfs-homedir

/usr/local/bin/mount-zfs-homedir

#!/bin/bash
set -eu

# $PAM_USER will be the username of the user, you can use it for per-user home volumes.
HOME_VOLUME="rpool/home" 

if [ "$(zfs get keystatus "${HOME_VOLUME}" -Ho value)" != "available" ]; then
  PASSWORD=$(cat -)
  zfs load-key "${HOME_VOLUME}" <<< "$PASSWORD" || continue
fi

# This will also mount any child datasets, unless they use a different key.
echo "$(zfs list -rHo name,keystatus,mounted "${HOME_VOLUME}")" | while IFS=$'\t' read -r NAME KEYSTATUS MOUNTED; do
  if [ "${MOUNTED}" != "yes" ] && [ "${KEYSTATUS}" == "available" ]; then
    zfs mount "${NAME}" || true
  fi
done

do not forget to make it executable; then get PAM to run it by adding the following line to /etc/pam.d/system-auth:

/etc/pam.d/system-auth

auth       optional                    pam_exec.so          expose_authtok /usr/local/bin/mount-zfs-homedir

Now it will transparently decrypt and mount the /home volume when you log in anywhere: on the console, via ssh, etc.

SSH

A caveat is that since your ~/.ssh directory is not mounted, if you log in via ssh, you must use password authentication the first time rather than relying on ~/.ssh/authorized_keys.

If you do not wish to enable (insecure) password authentication, you can instead move ~/.ssh/authorized_keys to a new location. Make /etc/ssh/user_config/ and inside it a folder for each user, owned by that user and with 700 permissions. Then move each user's authorized_keys into their respective folders, and edit the system sshd configuration:

/etc/ssh/sshd_config

AuthorizedKeysFile /etc/ssh/user_config/%u/authorized_keys

Then restart sshd.service. You can also optionally make a link for each user from ~/.ssh/authorized_keys to the new location so users can still edit it as they are used to.

This will let you log in, but your home partition will not be mounted, and you will need to do so manually. There are multiple options to work around this:

SSH Key & Password when required

It is possible to set up PAM to only prompt for a password via SSH when it is necessary to decrypt your home partition. You will need to enable both publickey and keyboard-interactive authentication methods:

/etc/ssh/sshd_config

PubkeyAuthentication yes
KbdInteractiveAuthentication yes
AuthenticationMethods publickey,keyboard-interactive

## Example of excluding a certain user who does not have an encrypted home directory.
#Match User nohome
#  KbdInteractiveAuthentication no
#  AuthenticationMethods publickey

Warning: Note the comma in AuthenticationMethods publickey,keyboard-interactive, this means that you need to do both authentication methods to log in with SSH. The very similar AuthenticationMethods publickey keyboard-interactive means you can do either to log in, which would let someone bypass your public key auth.

Note: You may ask why keyboard-interactive and not password? password is done client-side, so even if the auth is skipped, the user is still prompted and the password is just thrown away. With keyboard-interactive the user does not get prompted at all when we skip it.

This will mean it asks for the password after validating the key, but using PAM we can stop it asking for the password when not needed. We make a script that will fail when the key is not available to us:

/usr/local/bin/require-encrypted-homedir

#!/bin/bash
set -eu

HOME_VOLUME="zroot/data/home" # You can use $PAM_USER to use the username in the volume for a per-user solution.

if [ "$(zfs get keystatus "${HOME_VOLUME}" -Ho value)" != "available" ]; then
  exit 27 # PAM_TRY_AGAIN
elif [[ "${SSH_AUTH_INFO_0:-""}" =~ ^"publickey " ]]; then
  exit 0
else
  # If this happens, it implies a configuration error: either you are allowing auth without a public 
  # key, or have enabled this in a non-SSH PAM service. Both are dangerous and this should block it, 
  # but if you see it, fix your configuration.
  exit 3 # PAM_SERVICE_ERR
fi

And make it executable.

Now we want to configure PAM to call this, and skip asking for the password if the script succeeds because we already have the key available. Add this line above the existing auth line(s) you want to skip (all of them unless you have something else set up) for the SSH service:

/etc/pam.d/sshd

auth sufficient pam_exec.so /usr/local/bin/require-encrypted-homedir

Warning: This is for /etc/pam.d/sshd not /etc/pam.d/system-auth as above. You do not want local users without a public key to be able to skip the password. There a safeguard in the script against this, but still best to be careful.

Note: When using private keys, the auth step is skipped in PAM as the private key authentication is handled entirely by sshd. This means that the script we are adding here will never be run for private keys and they cannot be skipped, however, we still do a check for defence-in-depth to try and ensure a key has been checked.

With this, you will be prompted for a password only when the key is not loaded.

SSH Key & Password

A simpler option is to just enable both methods, meaning your key still gets checked, but then you have to type the password too, which will decrypt your home partition.

/etc/ssh/sshd_config

PubkeyAuthentication yes
PasswordAuthentication yes
AuthenticationMethods publickey,password

Warning: Note the comma in AuthenticationMethods publickey,password, this means that you need to do both authentication methods to log in with SSH. The very similar AuthenticationMethods publickey password means you can do either to log in, which would let someone bypass your public key auth.

This works (and will not let anyone authenticate with just a password), but has the downside of requiring your password every time.

You can also specify something like:

AuthenticationMethods publickey password,publickey

This allows clients to either use either just a public key, or one and a password. Which the client will do will be based on the PreferredAuthentications option. -o PreferredAuthentications=password,publickey will ask for the password, while -o PreferredAuthentications=publickey will not. This is more manual than automated fallback, but has less moving parts, and avoids asking you every time if you prefer publickey by default (you can use host-specific options on clients to simplify setting these options).

Swap volume

Warning:

On systems with extremely high memory pressure, using a zvol for swap can result in lockup, regardless of how much swap is still available. This issue is currently being investigated in OpenZFS issue #7734
Swap on zvol does not support resume from hibernation, attempt to resume will result in pool corruption. Possible workaround: https://github.com/openzfs/zfs/issues/260#issuecomment-758782144

ZFS does not allow to use swapfiles, but users can use a ZFS volume (ZVOL) as swap. It is important to set the ZVOL block size to match the system page size, which can be obtained by the getconf PAGESIZE command (default on x86_64 is 4KiB). Another option useful for keeping the system running well in low-memory situations is not caching the ZVOL data.

Create a 8 GiB zfs volume:

# zfs create -V 8G -b $(getconf PAGESIZE) -o compression=zle \
              -o logbias=throughput -o sync=always\
              -o primarycache=metadata -o secondarycache=none \
              -o com.sun:auto-snapshot=false <pool>/swap

Prepare it as swap partition:

# mkswap -f /dev/zvol/<pool>/swap
# swapon /dev/zvol/<pool>/swap

To make it permanent, edit /etc/fstab. ZVOLs support discard, which can potentially help ZFS's block allocator and reduce fragmentation for all other datasets when/if swap is not full.

Add a line to /etc/fstab:

/dev/zvol/<pool>/swap none swap discard 0 0

Access Control Lists

To use ACL on a dataset:

# zfs set acltype=posixacl <nameofzpool>/<nameofdataset>
# zfs set xattr=sa <nameofzpool>/<nameofdataset>

Setting xattr is recommended for performance reasons [1].

It may be preferable to enable ACL on the zpool as datasets will inherit the ACL parameters. Setting aclinherit=passthrough may be wanted as the default mode is restricted [2]; however, it is worth noting that aclinherit does not affect POSIX ACLs [3]:

# zfs set aclinherit=passthrough <nameofzpool>
# zfs set acltype=posixacl <nameofzpool>
# zfs set xattr=sa <nameofzpool>

Since version 2.3 xattr=sa is the default and will be displayed as xattr=on[4].

Databases

ZFS, unlike most other file systems, has a variable record size, or what is commonly referred to as a block size. By default, the recordsize on ZFS is 128KiB, which means it will dynamically allocate blocks of any size from 512B to 128KiB depending on the size of file being written. This can often help fragmentation and file access, at the cost that ZFS would have to allocate new 128KiB blocks each time only a few bytes are written to.

The factual accuracy of this article or section is disputed.

Reason: At least MariaDB uses a default of 16Kib pages! Check your specific DBMS before setting this value. (Discuss in Talk:ZFS)

Most RDBMSes work in 8KiB-sized blocks by default. Although the block size is tunable for MySQL/MariaDB, PostgreSQL, and Oracle database, all three of them use an 8KiB block size by default. For both performance concerns and keeping snapshot differences to a minimum (for backup purposes, this is helpful), it is usually desirable to tune ZFS instead to accommodate the databases, using a command such as:

# zfs set recordsize=8K <pool>/postgres

These RDBMSes also tend to implement their own caching algorithm, often similar to ZFS's own ARC. In the interest of saving memory, it is best to simply disable ZFS's caching of the database's file data and let the database do its own job:

Note: L2ARC requires primarycache to function, because it is fed with data evicted from primarycache. If you intend to use the L2ARC, do not set the option below, otherwise no actual data will be cached on L2ARC.

# zfs set primarycache=metadata <pool>/postgres

ZFS uses the ZIL for crash recovery, but databases are often syncing their data files to the file system on their own transaction commits anyway. The end result of this is that ZFS will be committing data twice to the data disks, and it can severely impact performance. You can tell ZFS to prefer to not use the ZIL, and in which case, data is only committed to the file system once. However, doing so on non-solid state storage (e.g. HDDs) can result in decreased read performance due to fragmentation (OpenZFS Wiki) -- with mechanical hard drives, please consider using a dedicated SSD as ZIL rather than setting the option below. In addition, setting this for non-database file systems, or for pools with configured log devices, can also negatively impact the performance, so beware:

# zfs set logbias=throughput <pool>/postgres

These can also be done at file system creation time, for example:

# zfs create -o recordsize=8K \
             -o primarycache=metadata \
             -o mountpoint=/var/lib/postgres \
             -o logbias=throughput \
              <pool>/postgres

Please note: these kinds of tuning parameters are ideal for specialized applications like RDBMSes. You can easily hurt ZFS's performance by setting these on a general-purpose file system such as your /home directory.

/tmp

If you would like to use ZFS to store your /tmp directory, which may be useful for storing arbitrarily-large sets of files or simply keeping your RAM free of idle data, you can generally improve performance of certain applications writing to /tmp by disabling file system sync. This causes ZFS to ignore an application's sync requests (eg, with fsync or O_SYNC) and return immediately. While this has severe application-side data consistency consequences (never disable sync for a database!), files in /tmp are less likely to be important and affected. Please note this does not affect the integrity of ZFS itself, only the possibility that data an application expects on-disk may not have actually been written out following a crash.

# zfs set sync=disabled <pool>/tmp

Additionally, for security purposes, you may want to disable setuid and devices on the /tmp file system, which prevents some kinds of privilege-escalation attacks or the use of device nodes:

# zfs set setuid=off <pool>/tmp
# zfs set devices=off <pool>/tmp

Combining all of these for a create command would be as follows:

# zfs create -o setuid=off -o devices=off -o sync=disabled -o mountpoint=/tmp <pool>/tmp

Please note, also, that if you want /tmp on ZFS, you will need to mask (disable) systemd's automatic tmpfs-backed /tmp (tmp.mount), else ZFS will be unable to mount your dataset at boot-time or import-time.

Transmitting snapshots with ZFS Send and ZFS Recv

It is possible to pipe ZFS snapshots to an arbitrary target by pairing zfs send and zfs recv. This is done through standard output, which allows the data to be sent to any file, device, across the network, or manipulated mid-stream by incorporating additional programs in the pipe.

Below are examples of common scenarios:

Basic ZFS Send

First, create a snapshot of some ZFS filesystem:

# zfs snapshot zpool0/archive/books@snap

Now send the snapshot to a new location on a different zpool:

# zfs send -v zpool0/archive/books@snap | zfs recv zpool4/library

The contents of zpool0/archive/books@snap are now live at zpool4/library

Tip: See man zfs-send and man zfs-recv for details on flags.

To and from files

First, create a snapshot of some ZFS filesystem:

# zfs snapshot zpool0/archive/books@snap

Write the snapshot to a gzip file:

# zfs send zpool0/archive/books@snap > /tmp/mybooks.gz

Warning: Make sure to run zfs send with -w flag if you wish to preserve encryption during the send.

Now restore the snapshot from the file:

# gzcat /tmp/mybooks.gz | zfs recv -F zpool0/archive/books

Send over ssh

First, create a snapshot of some ZFS filesystem:

# zfs snapshot zpool1/filestore@snap

Next we pipe our "send" traffic over an ssh session running "recv":

# zfs send -v zpool1/filestore@snap | ssh $HOST zfs recv coldstore/backups

The -v flag prints information about the datastream being generated. If you are using a passphrase or passkey, you will be prompted to enter it.

Incremental Backups

You may wish update a previously sent ZFS filesystem without retransmitting all of the data over again. Alternatively, it may be necessary to keep a filesystem online during a lengthy transfer and it is now time to send writes that were made since the initial snapshot.

First, create a snapshot of some ZFS filesystem:

# zfs snapshot zpool1/filestore@initial

Next we pipe our "send" traffic over an ssh session running "recv":

# zfs send -v -R zpool1/filestore@initial | ssh $HOST zfs recv coldstore/backups

Once changes are written, make another snapshot:

# zfs snapshot zpool1/filestore@snap2

The following will send the differences that exist locally between zpool1/filestore@initial and zpool1/filestore@snap2 and create an additional snapshot for the remote filesystem coldstore/backups:

# zfs send -v -i -R zpool1/filestore@initial | ssh $HOST zfs recv coldstore/backups

Now both zpool1/filestore and coldstore/backups have the @initial and @snap2 snapshots.

On the remote host, you may now promote the latest snapshot to become the active filesystem:

# rollback coldstore/backups@snap2

Tuning

General

ZFS pools and datasets can be further adjusted using parameters.

Note: All settable properties, with the exception of quotas and reservations, inherit their value from the parent dataset.

To retrieve the current pool parameter status:

# zfs get all <pool>

To retrieve the current dataset parameter status:

# zfs get all <pool>/<dataset>

To disable access time (atime), which is enabled by default:

# zfs set atime=off <pool>

To disable access time (atime) on a particular dataset:

# zfs set atime=off <pool>/<dataset>

An alternative to turning off atime completely, relatime is available. This brings the default ext4/XFS atime semantics to ZFS, where access time is only updated if the modified time or changed time changes, or if the existing access time has not been updated within the past 24 hours. It is a compromise between atime=off and atime=on. This property only takes effect if atime is on:

# zfs set atime=on <pool>
# zfs set relatime=on <pool>

Compression is just that, transparent compression of data. ZFS supports a few different algorithms, presently lz4 is the default, gzip is also available for seldom-written yet highly-compressible data; consult the OpenZFS Wiki for more details.

To enable compression:

# zfs set compression=on <pool>

To reset a property of a pool and/or dataset to its default state, use zfs inherit:

# zfs inherit -rS atime <pool>
# zfs inherit -rS atime <pool>/<dataset>

Warning: Using the -r flag will recursively reset all datasets of the zpool.

Scrubbing

Whenever data is read and ZFS encounters an error, it is silently repaired when possible, rewritten back to disk and logged so you can obtain an overview of errors on your pools. There is no fsck or equivalent tool for ZFS. Instead, ZFS supports a feature known as scrubbing. This traverses through all the data in a pool and verifies that all blocks can be read.

To scrub a pool:

# zpool scrub <pool>

To cancel a running scrub:

# zpool scrub -s <pool>

How often should I do this?

From the Oracle blog post Disk Scrub - Why and When?:

This question is challenging for Support to answer, because as always the true answer is "It Depends". So before I offer a general guideline, here are a few tips to help you create an answer more tailored to your use pattern.

What is the expiration of your oldest backup? You should probably scrub your data at least as often as your oldest tapes expire so that you have a known-good restore point.
How often are you experiencing disk failures? While the recruitment of a hot-spare disk invokes a "resilver" -- a targeted scrub of just the VDEV which lost a disk -- you should probably scrub at least as often as you experience disk failures on average in your specific environment.
How often is the oldest piece of data on your disk read? You should scrub occasionally to prevent very old, very stale data from experiencing bit-rot and dying without you knowing it.

If any of your answers to the above are "I do not know", the general guideline is: you should probably be scrubbing your zpool at least once per month. It is a schedule that works well for most use cases, provides enough time for scrubs to complete before starting up again on all but the busiest & most heavily-loaded systems, and even on very large zpools (192+ disks) should complete fairly often between disk failures.

In the ZFS Administration Guide by Aaron Toponce, he advises to scrub consumer disks once a week.

Start with a service or timer

Note: Starting with OpenZFS 2.1.3 weekly and monthly systemd timers/services are included. To use these enable/start zfs-scrub-weekly@pool-to-scrub.timer or zfs-scrub-monthly@pool-to-scrub.timer for the desired pool.

Using a systemd timer/service it is possible to automatically scrub pools.

To perform scrubbing monthly on a particular pool:

/etc/systemd/system/zfs-scrub@.timer

[Unit]
Description=Monthly zpool scrub on %i

[Timer]
OnCalendar=monthly
AccuracySec=1h
Persistent=true

[Install]
WantedBy=multi-user.target

/etc/systemd/system/zfs-scrub@.service

[Unit]
Description=zpool scrub on %i

[Service]
Nice=19
IOSchedulingClass=idle
KillSignal=SIGINT
ExecStart=/usr/bin/zpool scrub %i

[Install]
WantedBy=multi-user.target

Enable/start zfs-scrub@pool-to-scrub.timer unit for monthly scrubbing the specified zpool.

Enabling TRIM

To quickly query your vdevs TRIM support, you can include trimming information in zpool status with -t.

$ zpool status -t tank

pool: tank
 state: ONLINE
  scan: none requested
 config:

	NAME                                     STATE     READ WRITE CKSUM
	tank                                     ONLINE       0     0     0
	  ata-ST31000524AS_5RP4SSNR-part1        ONLINE       0     0     0  (trim unsupported)
	  ata-CT480BX500SSD1_2134A59B933D-part1  ONLINE       0     0     0  (untrimmed)

errors: No known data errors

ZFS is capable of trimming supported vdevs either on-demand or periodically via the autotrim property.

Manually performing a TRIM operation on a zpool:

 # zpool trim <zpool>

Enabling periodic trimming on all supported vdevs in a pool:

 # zpool set autotrim=on <zpool>

Note: Because of how the automatic TRIM and a full zpool trim differ in their operation, it can make sense to run a manual trim occasionally.

To perform a full zpool trim monthly on a particular pool using a systemd timer/service:

/etc/systemd/system/zfs-trim@.timer

[Unit]
Description=Monthly zpool trim on %i

[Timer]
OnCalendar=monthly
AccuracySec=1h
Persistent=true

[Install]
WantedBy=multi-user.target

/etc/systemd/system/zfs-trim@.service

[Unit]
Description=zpool trim on %i
Documentation=man:zpool-trim(8)
Requires=zfs.target
After=zfs.target
ConditionACPower=true
ConditionPathIsDirectory=/sys/module/zfs

[Service]
Nice=19
IOSchedulingClass=idle
KillSignal=SIGINT
ExecStart=/bin/sh -c '\
if /usr/bin/zpool status %i | grep "trimming"; then\
exec /usr/bin/zpool wait -t trim %i;\
else exec /usr/bin/zpool trim -w %i; fi'
ExecStop=-/bin/sh -c '/usr/bin/zpool trim -s %i 2>/dev/null || true'

[Install]
WantedBy=multi-user.target

Enable/start zfs-trim@pool-to-trim.timer unit for monthly trimming of the specified zpool.

SSD Caching

If your pool has no configured log devices, ZFS reserves space on the pool's data disks for its intent log (the ZIL, also called SLOG). If your data disks are slow (e.g. HDD) it is highly recommended to configure the ZIL on solid state drives for better write performance and also to consider a layer 2 adaptive replacement cache (L2ARC). The process to add them is very similar to adding a new VDEV.

All of the below references to device-id are the IDs from /dev/disk/by-id/*.

ZIL

To add a mirrored ZIL:

 # zpool add <pool> log mirror <device-id-1> <device-id-2>

Or to add a single device ZIL:

 # zpool add <pool> log <device-id>

Because the ZIL device stores data that has not been written to the pool, it is important to use devices that can finish writes when power is lost. It is also important to use redundancy, since a device failure can cause data loss. In addition, the ZIL is only used for sync writes, so may not provide any performance improvement when your data drives are as fast as your ZIL drive(s).

L2ARC

To add L2ARC:

# zpool add <pool> cache <device-id>

L2ARC is only a read cache, so redundancy is unnecessary. Since ZFS version 2.0.0, L2ARC is persisted across reboots.[5]

L2ARC is generally only useful in workloads where the amount of hot data is bigger than system memory, but small enough to fit into L2ARC. The L2ARC is indexed by the ARC in system memory, consuming 70 bytes per record (default 128KiB). Thus, the equation for RAM usage is:

(L2ARC size) / (recordsize) * 70 bytes

Because of this, L2ARC can, in certain workloads, harm performance as it takes memory away from ARC.

ZVOLs

ZFS volumes (ZVOLs) can suffer from the same block size-related issues as RDBMSes, but it is worth noting that the default recordsize for ZVOLs is 8 KiB already. If possible, it is best to align any partitions contained in a ZVOL to your recordsize (current versions of fdisk and gdisk by default automatically align at 1MiB segments, which works), and file system block sizes to the same size. Other than this, you might tweak the recordsize to accommodate the data inside the ZVOL as necessary (though 8 KiB tends to be a good value for most file systems, even when using 4 KiB blocks on that level).

RAIDZ and Advanced Format physical disks

Each block of a ZVOL gets its own parity disks, and if you have physical media with logical block sizes of 4096B, 8192B, or so on, the parity needs to be stored in whole physical blocks, and this can drastically increase the space requirements of a ZVOL, requiring 2× or more physical storage capacity than the ZVOL's logical capacity. Setting the recordsize to 16k or 32k can help reduce this footprint drastically.

See OpenZFS issue #1807 for details.

I/O Scheduler

While ZFS is expected to work well with modern schedulers including, mq-deadline, and none, experimenting with manually setting the I/O scheduler on ZFS disks may yield performance gains. The ZFS recomendation is "[...] users leave the default scheduler “unless you’re encountering a specific problem, or have clearly measured a performance improvement for your workload”"[6]

Troubleshooting

Creating a zpool fails

If the following error occurs then it can be fixed.

# the kernel failed to rescan the partition table: 16
# cannot label 'sdc': try using parted(8) and then provide a specific slice: -1

One reason this can occur is because ZFS expects pool creation to take less than 1 second[7][8]. This is a reasonable assumption under ordinary conditions, but in many situations it may take longer. Each drive will need to be cleared again before another attempt can be made.

# parted /dev/sda rm 1
# parted /dev/sda rm 1
# dd if=/dev/zero of=/dev/sdb bs=512 count=1
# zpool labelclear /dev/sda

A brute force creation can be attempted over and over again, and with some luck the ZPool creation will take less than 1 second. One cause for creation slowdown can be slow burst read writes on a drive. By reading from the disk in parallel to ZPool creation, it may be possible to increase burst speeds.

# dd if=/dev/sda of=/dev/null

This can be done with multiple drives by saving the above command for each drive to a file on separate lines and running

# cat $FILE | parallel

Then run ZPool creation at the same time.

ZFS is using too much RAM

By default, ZFS caches file operations (ARC) using up to half of available system memory on the host. To adjust the ARC size, add the following to the Kernel parameters list:

zfs.zfs_arc_max=536870912 # (for 512MiB)

In case that the default value of zfs_arc_min (1/32 of system memory) is higher than the specified zfs_arc_max it is needed to add also the following to the Kernel parameters list:

zfs.zfs_arc_min=268435456 # (for 256MiB, needs to be lower than zfs.zfs_arc_max)

You may also want to increase zfs_arc_sys_free instead (in this example to 8GiB):

# echo $((8*1024**3)) > /sys/module/zfs/parameters/zfs_arc_sys_free

For a more detailed description, as well as other configuration options, see Gentoo:ZFS#ARC.

ZFS should release ARC as applications reserve more RAM, but some applications still get confused, and reported free RAM is always wrong. But in case all your applications work as intended and you have no problems, there is no need to change ARC settings.

No hostid found

An error that occurs at boot with the following lines appearing before initscript output:

ZFS: No hostid found on kernel command line or /etc/hostid.

This warning occurs because the ZFS module does not have access to the spl hosted. There are two solutions, for this. Either place the spl hostid in the kernel parameters in the boot loader. For example, adding spl.spl_hostid=0x00bab10c.

The other solution is to make sure that there is a hostid in /etc/hostid, and then regenerate the initramfs image. Which will copy the hostid into the initramfs image.

Pool cannot be found while booting from SAS/SCSI devices

In case you are booting a SAS/SCSI based, you might occassionally get boot problems where the pool you are trying to boot from cannot be found. A likely reason for this is that your devices are initialized too late into the process. That means that zfs cannot find any devices at the time when it tries to assemble your pool.

In this case you should force the scsi driver to wait for devices to come online before continuing. You can do this by putting this into /etc/modprobe.d/zfs.conf:

/etc/modprobe.d/zfs.conf

options scsi_mod scan=sync

Afterwards, regenerate the initramfs.

This works because the zfs hook will copy the file at /etc/modprobe.d/zfs.conf into the initcpio which will then be used at build time.

On boot the zfs pool does not mount stating: "pool may be in use from other system"

Unexported pool

If the new installation does not boot because the zpool cannot be imported, chroot into the installation and properly export the zpool. See #Emergency chroot repair with archzfs.

Once inside the chroot environment, load the ZFS module and force import the zpool,

# zpool import -a -f

now export the pool:

# zpool export <pool>

To see the available pools, use,

# zpool status

It is necessary to export a pool because of the way ZFS uses the hostid to track the system the zpool was created on. The hostid is generated partly based on the network setup. During the installation in the archiso the network configuration could be different generating a different hostid than the one contained in the new installation. Once the zfs filesystem is exported and then re-imported in the new installation, the hostid is reset. See Re: Howto zpool import/export automatically? - msg#00227.

If ZFS complains about "pool may be in use" after every reboot, properly export pool as described above, and then regenerate the initramfs in normally booted system.

Incorrect hostid

Double check that the pool is properly exported. Exporting the zpool clears the hostid marking the ownership. So during the first boot the zpool should mount correctly. If it does not there is some other problem.

Reboot again, if the zfs pool refuses to mount it means the hostid is not yet correctly set in the early boot phase and it confuses zfs. Manually tell zfs the correct number, once the hostid is coherent across the reboots the zpool will mount correctly.

Boot using zfs_force and write down the hostid. This one is just an example.

$ hostid

0a0af0f8

This number have to be added to the kernel parameters as spl.spl_hostid=0x0a0af0f8. Another solution is writing the hostid inside the initram image, which is as simple as creating a hostid file and regenerating your initram image.

Users can always ignore the check adding zfs_force=1 in the kernel parameters, but it is not advisable as a permanent solution.

Devices have different sector alignment

Once a drive has become faulted it should be replaced A.S.A.P. with an identical drive.

# zpool replace bigdata ata-ST3000DM001-9YN166_S1F0KDGY ata-ST3000DM001-1CH166_W1F478BD -f

but in this instance, the following error is produced:

cannot replace ata-ST3000DM001-9YN166_S1F0KDGY with ata-ST3000DM001-1CH166_W1F478BD: devices have different sector alignment

ZFS uses the ashift option to adjust for physical block size. When replacing the faulted disk, ZFS is attempting to use ashift=12, but the faulted disk is using a different ashift (probably ashift=9) and this causes the resulting error.

For Advanced Format disks with 4 KiB block size, an ashift of 12 is recommended for best performance. See OpenZFS FAQ: Performance Considerations and ZFS and Advanced Format disks.

Use zdb to find the ashift of the zpool: zdb , then use the -o argument to set the ashift of the replacement drive:

# zpool replace bigdata ata-ST3000DM001-9YN166_S1F0KDGY ata-ST3000DM001-1CH166_W1F478BD -o ashift=9 -f

Check the zpool status for confirmation:

# zpool status -v

pool: bigdata
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jun 16 11:16:28 2014
    10.3G scanned out of 5.90T at 81.7M/s, 20h59m to go
    2.57G resilvered, 0.17% done
config:

        NAME                                   STATE     READ WRITE CKSUM
        bigdata                                DEGRADED     0     0     0
        raidz1-0                               DEGRADED     0     0     0
            replacing-0                        OFFLINE      0     0     0
            ata-ST3000DM001-9YN166_S1F0KDGY    OFFLINE      0     0     0
            ata-ST3000DM001-1CH166_W1F478BD    ONLINE       0     0     0  (resilvering)
            ata-ST3000DM001-9YN166_S1F0JKRR    ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0KBP8    ONLINE       0     0     0
            ata-ST3000DM001-9YN166_S1F0JTM1    ONLINE       0     0     0

errors: No known data errors

Pool resilvering stuck/restarting/slow?

According to ZFS issue #840, this is a known issue since 2012 with ZFS-ZED which causes the resilvering process to constantly restart, sometimes get stuck and be generally slow for some hardware. The simplest mitigation is to stop zfs-zed.service until the resilver completes.

Fix slow boot caused by failed import of unavailable pools in the initramfs zpool.cache

Your boot time can be significantly impacted if you update your intitramfs (eg when doing a kernel update) when you have additional but non-permanently attached pools imported because these pools will get added to your initramfs zpool.cache and ZFS will attempt to import these extra pools on every boot, regardless of whether you have exported it and removed it from your regular zpool.cache.

If you notice ZFS trying to import unavailable pools at boot, first run:

$ zdb -C

To check your zpool.cache for pools you do not want imported at boot. If this command is showing (an) additional, currently unavailable pool(s), run:

# zpool set cachefile=/etc/zfs/zpool.cache zroot

To clear the zpool.cache of any pools other than the pool named zroot. Sometimes there is no need to refresh your zpool.cache, but instead all you need to do is regenerate the initramfs.

ZFS Command History

ZFS logs changes to a pool's structure natively as a log of executed commands in a ring buffer (which cannot be turned off). The log may be helpful when restoring a degraded or failed pool.

# zpool history zpool

History for 'zpool':
2023-02-19.16:28:44 zpool create zpool raidz1 /scratch/disk_1.img /scratch/disk_2.img /scratch/disk_3.img
2023-02-19.16:31:29 zfs set compression=lz4 zpool
2023-02-19.16:41:45 zpool scrub zpool
2023-02-19.17:00:57 zpool replace zpool /scratch/disk_1.img /scratch/bigger_disk_1.img
2023-02-19.17:01:34 zpool scrub zpool
2023-02-19.17:01:42 zpool replace zpool /scratch/disk_2.img /scratch/bigger_disk_2.img
2023-02-19.17:01:46 zpool replace zpool /scratch/disk_3.img /scratch/bigger_disk_3.img

Tips and tricks

Create an Archiso image with ZFS support

See Install Arch Linux on ZFS#Embedding ZFS module into custom archiso.

Automatic snapshots

zrepl

The zrepl^AUR package provides a ZFS automatic replication service, which could also be used as a snapshotting service much like snapper.

For details on how to configure the zrepl daemon, see the zrepl documentation. The configuration file should be located at /etc/zrepl/zrepl.yml. Then, run zrepl configcheck to make sure that the syntax of the config file is correct. Finally, enable zrepl.service.

sanoid

sanoid^AUR is a policy-driven tool for taking snapshots. Sanoid also includes syncoid, which is for replicating snapshots. It comes with systemd services and a timer.

Sanoid only prunes snapshots on the local system. To prune snapshots on the remote system, run sanoid there as well with prune options. Either use the --prune-snapshots command line option or use the --cron command line option together with the autoprune = yes and autosnap = no configuration options.

ZFS Automatic Snapshot Service for Linux

Note: zfs-auto-snapshot-git^AUR has not seen any updates since 2019, and the functionality is extremely limited. You are advised to switch to a newer tool like zrepl^AUR.

The zfs-auto-snapshot-git^AUR package provides a shell script to automate the management of snapshots, with each named by date and label (hourly, daily, etc), giving quick and convenient snapshotting of all ZFS datasets. The package also installs cron tasks for quarter-hourly, hourly, daily, weekly, and monthly snapshots. Optionally adjust the --keep parameter from the defaults depending on how far back the snapshots are to go (the monthly script by default keeps data for up to a year).

To prevent a dataset from being snapshotted at all, set com.sun:auto-snapshot=false on it. Likewise, set more fine-grained control as well by label, if, for example, no monthlies are to be kept on a snapshot, for example, set com.sun:auto-snapshot:monthly=false.

Note: zfs-auto-snapshot-git will not create snapshots during scrubbing. It is possible to override this by editing provided systemd unit and removing --skip-scrub from ExecStart line. Consequences not known, someone please edit.

Once the package has been installed, enable and start the selected timers (zfs-auto-snapshot-{frequent,daily,weekly,monthly}.timer).

Creating a share

ZFS has support for creating shares by NFS or SMB.

NFS

Make sure NFS has been installed/configured, note there is no need to edit the /etc/exports file. For sharing over NFS the services nfs-server.service and zfs-share.service should be started.

To make a pool available on the network:

# zfs set sharenfs=on nameofzpool

To make a dataset available on the network:

# zfs set sharenfs=on nameofzpool/nameofdataset

To enable read/write access for a specific ip-range(s):

# zfs set sharenfs="rw=@192.168.1.100/24,rw=@10.0.0.0/24" nameofzpool/nameofdataset

To check if the dataset is exported successfully:

# showmount -e `hostname`

Export list for hostname:
/path/of/dataset 192.168.1.100/24

To view the current loaded exports state in more detail, use:

# exportfs -v

/path/of/dataset
    192.168.1.100/24(sync,wdelay,hide,no_subtree_check,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)

To view the current NFS share list by ZFS:

# zfs get sharenfs

SMB

Note: SMB functionality is very limited. The usershare path must be /var/lib/samba/usershares and the only supported sharesmb options are on and off. Enabling guest access via sharesmb=guest_ok=y is not supported.

When sharing through SMB, using usershares in /etc/samba/smb.conf will allow ZFS to setup and create the shares. See Samba#Enable Usershares for details.

/etc/samba/smb.conf

[global]
    usershare path = /var/lib/samba/usershares
    usershare max shares = 100
    usershare allow guests = yes
    usershare owner only = no

Create and set permissions on the user directory as root

# mkdir /var/lib/samba/usershares
# chmod +t /var/lib/samba/usershares

To make a pool available on the network:

# zfs set sharesmb=on nameofzpool

To make a dataset available on the network:

# zfs set sharesmb=on nameofzpool/nameofdataset

To check if the dataset is exported successfully:

# smbclient -L localhost -U%

        Sharename       Type      Comment
        ---------       ----      -------
        IPC$            IPC       IPC Service (SMB Server Name)
        nameofzpool_nameofdataset        Disk      Comment: path/of/dataset
SMB1 disabled -- no workgroup available

To view the current SMB share list by ZFS:

# zfs get sharesmb

Encryption in ZFS using dm-crypt

Before OpenZFS version 0.8.0, ZFS did not support encryption directly (See #Native encryption). Instead, zpools can be created on dm-crypt block devices. Since the zpool is created on the plain-text abstraction, it is possible to have the data encrypted while having all the advantages of ZFS like deduplication, compression, and data robustness. Furthermore, utilizing dm-crypt will encrypt the zpools metadata, which the native encryption can inherently not provide.[9]

dm-crypt, possibly via LUKS, creates devices in /dev/mapper and their name is fixed. So you just need to change zpool create commands to point to that names. The idea is configuring the system to create the /dev/mapper block devices and import the zpools from there. Since zpools can be created in multiple devices (raid, mirroring, striping, ...), it is important all the devices are encrypted otherwise the protection might be partially lost.

For example, an encrypted zpool can be created using plain dm-crypt (without LUKS) with:

# cryptsetup open --type=plain --hash=sha256 --cipher=aes-xts-plain64 --offset=0 \
             --key-file=/dev/sdZ --key-size=512 /dev/sdX enc
# zpool create zroot /dev/mapper/enc

In the case of a root filesystem pool, the mkinitcpio.conf HOOKS line will enable the keyboard for the password, create the devices, and load the pools. It will contain something like:

HOOKS=(... keyboard encrypt zfs ...)

Since the /dev/mapper/enc name is fixed no import errors will occur.

Creating encrypted zpools works fine. But if you need encrypted directories, for example to protect your users' homes, ZFS loses some functionality.

ZFS will see the encrypted data, not the plain-text abstraction, so compression and deduplication will not work. The reason is that encrypted data has always high entropy making compression ineffective and even from the same input you get different output (thanks to salting) making deduplication impossible. To reduce the unnecessary overhead it is possible to create a sub-filesystem for each encrypted directory and use eCryptfs on it.

For example to have an encrypted home: (the two passwords, encryption and login, must be the same)

# zfs create -o compression=off -o dedup=off -o mountpoint=/home/<username> <zpool>/<username>
# useradd -m <username>
# passwd <username>
# ecryptfs-migrate-home -u <username>
<log in user and complete the procedure with ecryptfs-unwrap-passphrase>

Emergency chroot repair with archzfs

To get into the ZFS filesystem from live system for maintenance, there are two options:

Build custom archiso with ZFS as described in #Create an Archiso image with ZFS support.
Boot the latest official archiso and bring up the network. Then enable archzfs repository inside the live system as usual, sync the pacman package database and install the archzfs-archiso-linux package.

To start the recovery, load the ZFS kernel modules:

# modprobe zfs

Import the pool:

# zpool import -a -R /mnt

Mount the boot partition and EFI system partition (if any):

# mount /dev/sda2 /mnt/boot
# mount /dev/sda1 /mnt/efi

Chroot into the ZFS filesystem:

# arch-chroot /mnt /bin/bash

Check the kernel version:

# pacman -Qi linux
# uname -r

uname will show the kernel version of the archiso. If they are different, run depmod (in the chroot) with the correct kernel version of the chroot installation:

# depmod -a 3.6.9-1-ARCH (version gathered from pacman -Qi linux but using the matching kernel modules directory name under the chroot's /lib/modules)

This will load the correct kernel modules for the kernel version installed in the chroot installation.

Regenerate the initramfs. There should be no errors.

Bind mount

Here a bind mount from /mnt/zfspool to /srv/nfs4/music is created. The configuration ensures that the zfs pool is ready before the bind mount is created.

fstab

See systemd.mount(5) for more information on how systemd converts fstab into mount unit files with systemd-fstab-generator(8).

/etc/fstab

/mnt/zfspool		/srv/nfs4/music		none	bind,defaults,nofail,x-systemd.requires=zfs-mount.service	0 0

Monitoring / Mailing on Events

See ZED: The ZFS Event Daemon for more information.

An email forwarder, such as S-nail, is required to accomplish this. Test it to be sure it is working correctly.

Uncomment the following in the configuration file:

/etc/zfs/zed.d/zed.rc

 ZED_EMAIL_ADDR="root"
 ZED_EMAIL_PROG="mailx"
 ZED_NOTIFY_VERBOSE=0
 ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"

Update 'root' in ZED_EMAIL_ADDR="root" to the email address you want to receive notifications at.

If you are keeping your mailrc in your home directory, you can tell mail to get it from there by setting MAILRC:

/etc/zfs/zed.d/zed.rc

export MAILRC=/home/<user>/.mailrc

This works because ZED sources this file, so mailx sees this environment variable.

If you want to receive an email no matter the state of your pool, you will want to set ZED_NOTIFY_VERBOSE=1. You will need to do this temporary to test.

Start and enable zfs-zed.service.

With ZED_NOTIFY_VERBOSE=1, you can test by running a scrub as root: zpool scrub <pool-name>.

Wrap shell commands in pre & post snapshots

Since it is so cheap to make a snapshot, we can use this as a measure of security for sensitive commands such as system and package upgrades. If we make a snapshot before, and one after, we can later diff these snapshots to find out what changed on the filesystem after the command executed. Furthermore we can also rollback in case the outcome was not desired.

znp

E.g.:

# zfs snapshot -r zroot@pre
# pacman -Syu
# zfs snapshot -r zroot@post
# zfs diff zroot@pre zroot@post 
# zfs rollback zroot@pre

A utility that automates the creation of pre and post snapshots around a shell command is znp.

E.g.:

# znp pacman -Syu
# znp find / -name "something*" -delete

and you would get snapshots created before and after the supplied command, and also output of the commands logged to file for future reference so we know what command created the diff seen in a pair of pre/post snapshots.

Remote unlocking of ZFS encrypted root

Warning: Some of the hooks mentioned below, such as mkinitcpio-netconf, mkinitcpio-tinyssh and mkinitcpio-dropbear, have been unmaintained for a long time and have multiple issues, some of them security related. Consider using mkinitcpio-extras^AUR, as it provides the functionality of the above hooks (plus additional functionality such as configuring a shell, changing the port of the SSH server, etc.) and is actively maintained.

As of PR #261, archzfs supports SSH unlocking of natively-encrypted ZFS datasets. This section describes how to use this feature, and is largely based on dm-crypt/Specialties#Busybox based initramfs (built with mkinitcpio).

Install mkinitcpio-netconf to provide hooks for setting up early user space networking.
Choose an SSH server to use in early user space. The options are mkinitcpio-tinyssh or mkinitcpio-dropbear, and are mutually exclusive.
1. If using mkinitcpio-tinyssh, it is also recommended to install tinyssh or tinyssh-convert-git^AUR. This tool converts an existing OpenSSH hostkey to the TinySSH key format, preserving the key fingerprint and avoiding connection warnings. The TinySSH and Dropbear mkinitcpio install scripts will automatically convert existing hostkeys when generating a new initcpio image.
Decide whether to use an existing OpenSSH key or generate a new one (recommended) for the host that will be connecting to and unlocking the encrypted ZFS machine. Copy the public key into /etc/tinyssh/root_key or /etc/dropbear/root_key. When generating the initcpio image, this file will be added to authorized_keys for the root user and is only valid in the initrd environment.
Add the ip= kernel parameter to your boot loader configuration. The ip string is highly configurable. A simple DHCP example is shown below.
```
ip=:::::eth0:dhcp
```
Edit /etc/mkinitcpio.conf to include the netconf, dropbear or tinyssh, and zfsencryptssh hooks before the zfs hook:
```
HOOKS=(... netconf <tinyssh>|<dropbear> zfsencryptssh zfs ...)
```
Regenerate the initramfs.
Reboot and try it out!

Changing the SSH server port

By default, mkinitcpio-tinyssh and mkinitcpio-dropbear listen on port 22. You may wish to change this.

For TinySSH, copy /usr/lib/initcpio/hooks/tinyssh to /etc/initcpio/hooks/tinyssh, and find/modify the following line in the run_hook() function:

/etc/initcpio/hooks/tinyssh

/usr/bin/tcpserver -HRDl0 0.0.0.0 <new_port> /usr/sbin/tinysshd -v /etc/tinyssh/sshkeydir &

For Dropbear, copy /usr/lib/initcpio/hooks/dropbear to /etc/initcpio/hooks/dropbear, and find/modify the following line in the run_hook() function:

/etc/initcpio/hooks/tinyssh

 /usr/sbin/dropbear -E -s -j -k -p <new_port>

Regenerate the initramfs.

Unlocking from a Windows machine using PuTTY/Plink

First, we need to use puttygen.exe to import and convert the OpenSSH key generated earlier into PuTTY's .ppk private key format. We will call it zfs_unlock.ppk for this example.

The mkinitcpio-netconf process above does not setup a shell (nor do we need need one). However, because there is no shell, PuTTY will immediately close after a successful connection. This can be disabled in the PuTTY SSH configuration (Connection > SSH > [X] Do not start a shell or command at all), but it still does not allow us to see stdout or enter the encryption passphrase. Instead, we use plink.exe with the following parameters:

plink.exe -ssh -l root -i c:\path\to\zfs_unlock.ppk <hostname>

The plink command can be put into a batch script for ease of use.

Enabling bclone support

To use cp --reflink and other commands needing bclone support, it is necessary to upgrade the feature flags if coming from a version prior to 2.2.2. This will allow the pool to have support for bclone. This is done with zpool upgrade, if the status of the pool show this is possible.

It is also required to enable a module parameter, otherwise userspace apps will not be able to use this feature. You can do this by putting this into /etc/modprobe.d/zfs.conf:

/etc/modprobe.d/zfs.conf

options zfs zfs_bclone_enabled=1

Check that is working, and how much space is being saved with the command: zpool get all POOLNAME | grep clon

Avoid upgrading to incompatible kernels when using zfs-dkms

When using zfs-dkms^AUR, kernels can upgrade to incompatible versions and post-transaction hooks, such as rebuilding the ZFS module and regenerating initramfs, can encounter errors which go unnoticed until a user reboots.

To avoid such kernel upgrades, one can create a Pacman pre-transaction hook:

/etc/pacman.d/hooks/00-zfs-dkms-guard.hook

[Trigger]
Operation = Install
Operation = Upgrade
Type = Path
# target the LTS kernel
Target = usr/lib/modules/*-lts/vmlinuz
# ...or the mainline one
# Target = !usr/lib/modules/*-rt*arch*/vmlinuz
# Target = usr/lib/modules/*-arch*/vmlinuz
# ...or all of them
# Target = usr/lib/modules/*/vmlinuz

[Action]
Description = Avoid ZFS-incompatible LTS kernels
When = PreTransaction
Exec = /usr/local/bin/zfs_dkms_guard.sh
AbortOnFail
NeedsTargets

/usr/local/bin/zfs_dkms_guard.sh

#!/usr/bin/env bash

set -eo pipefail
shopt -s inherit_errexit

vercomp() {
    readarray -td. one <<<"$1"
    readarray -td. two <<<"$2"
    for i in 0 1; do
        if [[ "${one[$i]}" -lt "${two[$i]}" ]]; then
            printf -- '-1'
            return
        fi
        if [[ "${one[$i]}" -gt "${two[$i]}" ]]; then
            printf '1'
            return
        fi
    done
    printf '0'
}

# requires ZFS >= 0.8.0
# https://github.com/openzfs/zfs/commit/928046b744a6d3f93f5ba1351a6d700a62535a00
zfs_meta_file="$(grep '/META\b' < <(pacman -Qql zfs-dkms))"
zfs_linux_min="$(awk '/Linux-Minimum/ { print $2 }' "$zfs_meta_file")"
zfs_linux_max="$(awk '/Linux-Maximum/ { print $2 }' "$zfs_meta_file")"

while read -r target_path; do
    kernel_version="$(sed -E 's|.*lib/modules/([[:digit:]]+\.[[:digit:]]+).*|\1|' <<<"$target_path")"
    if [[ "$(vercomp "$kernel_version" "$zfs_linux_min")" -lt 0 ]]; then
        printf 'Kernel version %s is below ZFS minimum compatible version %s!\n' "$kernel_version" "$zfs_linux_min" >&2
        exit 1
    fi
    if [[ "$(vercomp "$kernel_version" "$zfs_linux_max")" -gt 0 ]]; then
        printf 'Kernel version %s is above ZFS maximum compatible version %s!\n' "$kernel_version" "$zfs_linux_max" >&2
        exit 1
    fi
done

Note this does not prevent ZFS upgrades leaving older kernels unsupported.