Bcachefs

From ArchWiki

Bcachefs is a next-generation CoW filesystem that aims to provide features from Btrfs and ZFS with a cleaner codebase, more stability, greater speed and a GPL-compatible license.

It is built upon Bcache and is mainly developed by Kent Overstreet.

Installation

As of kernel 6.7 (January 2024) Bcachefs has been merged into the upstream Kernel so it is available in the linux and linux-zen package. Other kernel packages may be based on older versions than 6.7 and need special patches for Bcachefs.

The Bcachefs userspace tools are available from bcachefs-tools.

Setup

Single drive

# bcachefs format /dev/sdX
# mount -t bcachefs /dev/sdX /mnt

Multiple drives

Bcachefs stripes data by default, similar to RAID0. Redundancy is handled via the replicas option. 2 drives with --replicas=2 is equivalent to RAID1, 4 drives with --replicas=2 is equivalent to RAID10, etc.

# bcachefs format /dev/sdX /dev/sdY --replicas=n
# mount -t bcachefs /dev/sdX:/dev/sdY /mnt

Heterogeneous drives are supported. If they are different sizes, larger stripes will be used on some, so that they all fill up at the same rate. If they are different speeds, reads for replicated data will be sent to the ones with the lowest IO latency. If some are more reliable than others (a hardware raid device, for example) you can set --durability=2 device to count each copy of data on that device as 2 replicas.

SSD caching

Bcachefs has 3 storage targets: background, foreground, and promote. Writes to the filesystem prioritize the foreground drives, which are then moved to the background over time. Reads are cached on the promote drives.

Note: These are only priority guidelines for a single large pool. Writes will go directly to the background if the foreground is full, or to promote if they both are. Metadata will prefer the foreground, but can be written to any of them. Be careful when removing a cache drive, as it may still contain data. see #Removing a device

A recommended configuration is to use an ssd group for the foreground and promote, and an hdd group for the background (a writeback cache).

# bcachefs format \
    --label=ssd.ssd1 /dev/sdA \
    --label=ssd.ssd2 /dev/sdB \
    --label=hdd.hdd1 /dev/sdC \
    --label=hdd.hdd2 /dev/sdD \
    --label=hdd.hdd3 /dev/sdE \
    --label=hdd.hdd4 /dev/sdF \
    --replicas=2 \
    --foreground_target=ssd \
    --promote_target=ssd \
    --background_target=hdd
# mount -t bcachefs /dev/sdA:/dev/sdB:/dev/sdC:/dev/sdD:/dev/sdE:/dev/sdF /mnt

For a writethrough cache, do the same as above, but set --durability=0 device on each of the ssd devices. For a writearound cache, foreground target to the hdd group, and promote target to the ssd group.

Mounting

The default way of mounting is to specify every device in the mount directive.

# mount -t bcachefs /dev/sdA:/dev/sdB:/dev/sdC:/dev/sdD:/

The mount.bcachefs command supports mounting a filesystem by UUID, which is displayed by bcachefs format on filesystem creation.

# mount.bcachefs UUID=f66d108f-83d2-4679-b50b-7d5e710f6a2b /mnt/

Configuration

This article or section needs expansion.

Reason: Missing details on which options should be used (Discuss in Talk:Bcachefs)

Most options can be set

  • during bcachefs format,
  • after format with bcachefs set-fs-option,
  • at mount time with mount -o option=value,
  • or through sysfs, for example, echo X > /sys/fs/bcachefs/UUID/options/option.

Mount options override those set by the other methods, which save them to the filesystem's superblock.

Note: The filesystem must be mounted for sysfs to be available. All operations except fsck are possible on a live filesystem.

Examples of some available options are:

Bcachefs options
Option Description
metadata_checksum specifies the checksum algorithm to be used for metadata writes. By default the algorithm is crc32c. You can choose one of none, crc32c, crc64, xxhash.
data_checksum specifies the checksum algorithm to be used for data writes, shares the same defaults and options as metadata_checksum.
compression specifies the algorithm to be used for (foreground) compression. By default this option is none. You can choose one of none, lz4, gzip, zstd.
background_compression specifies the algorithm to be used for (background) compression, shares the same defaults and options as compression.
str_hash specifies the hashing function to be used for directory entries and xattrs. You can choose one of crc32c, crc64 and siphash.
nocow all writes will be done in place when possible. Snapshots and reflinks will still cause writes to be COW, this option implicitly disables data checksumming, compression and encryption.
encrypted enables encryption on the filesystem (chacha20/poly1305); passphrase will be prompted for.

More options can be found in the bcachefs documentation.

The following can also be set on a per directory or per file basis with bcachefs setattr file --option=value. It will propagate options recursively if you set it on a directory.

Note: The rebalance thread does not yet adjust replicas in the background. That means that if you change replica options on files you have to manually run the rereplicate command to ensure old files follow the new rule.
  • data_replicas
  • data_checksum
  • compression, background_compression
  • foreground_target, background_target, promote_target

To check what options are active you can do getfattr -d -m 'bcachefs_effective\.' directory/file

Note: Disk usage reporting currently shows uncompressed size. Compression is otherwise complete.

Changing a device's group

The group of a device can be changed through the sysfs.

# echo group.drive_name > /sys/fs/bcachefs/filesystem_uuid/dev-X/label
Note: This requires a remount to take effect.

Adding a device

# bcachefs device add --label=group.drive_name /mnt /dev/device

If this is the first drive in a group, you will need to change the target settings to make use of it. This example is for adding a cache drive.

# echo new_group > /sys/fs/bcachefs/filesystem_uuid/options/promote_target
# echo new_group > /sys/fs/bcachefs/filesystem_uuid/options/foreground_target
# echo old_group > /sys/fs/bcachefs/filesystem_uuid/options/background_target
Note: Only new writes will be striped across added devices. Existing ones will be unchanged until disk usage reaches a certain threshold, when the disk rebalance is triggered. It is not currently possible to manually trigger a rebalance/restripe.

Removing a device

First make sure there are at least 2 metadata replicas (Evacuate does not appear to work for metadata). If your data and metadata are already replicated, you may skip this step.

# echo 2 > /sys/fs/bcachefs/UUID/options/metadata_replicas
# bcachefs data rereplicate /mnt
# bcachefs device set-state ro device
# bcachefs device evacuate device

Setting state ro meaning read-only.

To remove the device:

# bcachefs device remove device
# bcachefs data rereplicate /mnt

Replication

Metadata and data replicas can be configured separately depending upon the level of redundancy a user desires. There are five options relating to replicas:

  • --replicas=X sets the number of metadata and data replicas at the same time.
  • --metadata_replicas=X sets the number of metadata replicas which will eventually be written.
  • --data_replicas=X sets the number of data replicas which will eventually be written.
  • --metadata_replicas_required=X sets the number of metadata replicas which must be written before the metadata is considered "written".
  • --data_replicas_required=X sets the number of data replicas which must be written before the data is considered "written".
Note:

The factual accuracy of this article or section is disputed.

Reason: The _required suffix is used to define the limit used for the mount -o degraded option. [1] (Discuss in Talk:Bcachefs)

The distinction between --[meta]data_replicas_required and --[meta]data_replicas is important, as the replicas required value sets the floor for the number of replicas that will be written immediately, whereas the replicas value sets the target number of replicas that will eventually be written.

Compression

Compression is set with the --compression= option. It is also possible to set the compression level. As an example to set zstd compression level 5, you can use --compression=zstd:5.

Subvolumes

Bcachefs supports subvolumes and snapshots with a similar userspace interface as Btrfs. A new subvolume may be created empty, or it may be created as a snapshot of another subvolume. Snapshots are writeable and may be snapshot-ted again, creating a tree of snapshots.

Snapshots are very cheap to create: they’re not based on cloning of COW btrees as with Btrfs, but instead are based on versioning of individual keys in the btrees. Many thousands or millions of snapshots can be created, with the only limitation being disk space.

Creating a subvolume

To create a new, empty subvolume:

# bcachefs subvolume create /path/to/subvolume

Deleting a subvolume

To delete an existing subvolume or snapshot:

# bcachefs subvolume delete /path/to/subvolume

Creating a snapshot of an existing subvolume

To create a snapshot of an existing subvolume:

# bcachefs subvolume snapshot /path/to/source /path/to/dest

A subvolume can also be deleting with a normal rmdir after deleting all the contents, as with rm -rf.

Features including recursive snapshot creation and a method for recursively listing subvolume are still to be implemented.

Tips and tricks

This article or section needs expansion.

Reason: Information on auto-mounting would be useful (Discuss in Talk:Bcachefs)

Check the journal for more useful error messages.

Flag Ordering

Some bcachefs format flags are set based upon their argument order and only affect drives that come after the flag is toggled. For example, if you want SSDs to have --durability=0 and enable --discard while HDDs use defaults, make sure arguments are passed in the following order:

# bcachefs format \
    --label=hdd.hdd1 /dev/sdC \
    --label=hdd.hdd2 /dev/sdD \
    --label=hdd.hdd3 /dev/sdE \
    --label=hdd.hdd4 /dev/sdF \
    --durability=0 --discard \
    --label=ssd.ssd1 /dev/sdA \
    --label=ssd.ssd2 /dev/sdB \
    --replicas=2 \
    --foreground_target=ssd \
    --promote_target=ssd \
    --background_target=hdd

Setting replicas after format

It is possible to set replica count after format using set-fs-option.

# bcachefs set-fs-option --metadata_replicas=2 --data_replicas=2 /dev/sdX

Afterwards you'll need to tell bcachefs to ensure that all files have a replica with:

# bcachefs data rereplicate /mnt

Troubleshooting

32-bit programs cannot see directory contents

Some 32-bit programs may fail to retrieve contents of directories in Bcachefs, due to incompatibility of data returned by the filesystem when a readdir(3) syscall is performed. [2]

This can be worked around by temporarily using a different filesystem, such as tmpfs, for such a program to read and write from.

swapfile contains holes or other unsupported extents.

Bcachefs does not currently support swapfiles.

Multi-device fstab

There is currently a bug in systemd that does not make it possible for it to mount a multi-device bcachefs filesystem at boot using devices separated by colons in fstab. It will work when doing mount -a, but will not mount at boot. However since bcachefs-tools version 1.7.0 it is possible to mount a multi-device array using one device node; this allows the use of the normal UUID specifier.

# UUID=10176fc9-c4fa-4a30-9fd0-a756d861c4cd     /mnt   bcachefs defaults,nofail 0 0

The filesystem UUID / External UUID can be found by either using:

# bcachefs fs usage /mnt
# bcachefs show-super /dev/sdXY

Mounting an encrypted device errors

When the mounting of a device created with the --encrypted option fails after bcachefs unlock /dev/sdXY with

ERROR - bcachefs::commands::cmd_mount: Fatal error: Required key not available

It can be worked-around by manually linking the keys to the session[3]:

# keyctl link @u @s
# mount /dev/sdXY /mnt
Enter passphrase:

The renewed entry of the passphrase queried by mount is not necessary (pressing Enter suffices).

See also