Thursday, June 1, 2023

util-linux v2.39: Improved Mount Interface and Exciting Updates

This release comes with dramatic changes, and the most noticeable change is the support for a new kernel mount API in libmount.

The classic mount(2) system call has been with us since the early days of Linux. This interface is quite simple: you specify the source, target, filesystem options, mount flags, and ask the kernel to do the job. Unfortunately, the interface is too simplistic. These days, attaching a filesystem requires multiple steps, and it's important for userspace to assist in these steps and receive feedback from the kernel after each one.

We are all familiar with the infamous error message:

mount: wrong fs type, bad option, bad superblock on /mnt missing
       codepage or helper program, or other error.
 

This error message can be quite cryptic. In this case, mount(8) attempted to guess a few possibilities for the EINVAL errno, but nothing seemed relevant, so it prints this error message.

With the new interface, after the syscall fsopen("nonsense", 0), we can inform the user that "mount -t nonsense" is a bad idea.

Please note that you can still see this generic error in mount(8) from v2.39, as the goal for this release was to adopt the new interface. Optimization will come in future releases, so please be patient.

The new kernel mount interface is a set of syscalls that use file descriptors as a glue between them. This kind of interface is open to new extensions (new syscalls), and userspace and filesystem developers don't have to try to explain the entire universe in a comma-separated mount options string.

File descriptors are a game changer. We have a file descriptor to configure the superblock (the filesystem itself) and another file descriptor to set VFS (Virtual File System) node attributes and attach the node to the VFS tree. The file descriptor remains usable even when you change a namespace, etc.

The important thing is that userspace applications (like the mount(8) command) can work with a filesystem that is not yet attached. It means that mount(8) configures the filesystem, sets VFS flags (such as noexec, ...), and then attaches everything to the VFS to make it visible to other processes in the same namespace.

This is elegant, for example, when you need to set VFS flags in multiple steps. You can now set only the node in one step and recursively set all submounts in another step using "ro,noexec=recursive" in the new mount(8).

It also allows previously impossible operations to be mixed together. For example, "mount --move -oro /mnt/A /mnt/B".

Let's delve into the details, for example, the strace output for "mount -t ext4 -o ro /dev/sdc1 /mnt/test":

Classic mount(8):

mount("/dev/sdc1", "/mnt/test", "ext4", MS_RDONLY, NULL);

New interface:

fsopen("ext4", FSOPEN_CLOEXEC) = 3  
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/sdc1", 0) = 0 
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 
fsmount(3, FSMOUNT_CLOEXEC, 0) = 4 
 
mount_setattr(4,"", AT_EMPTY_PATH,{attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0}) = 0 
move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0

The steps in the interface are as follows:

  • Create a filesystem instance with fsopen()
  • Configure the filesystem with fsconfig()
  • Create a VFS node with fsmount()
  • Set VFS flags (e.g., read-only)
  • Attach the node to the tree (now it's visible to others)

Nothing is perfect. This change is so significant that it will take time to make it stable for all use cases. In some cases, the responsibility also lies with the kernel because filesystem drivers have to adopt the new interface as well. For example, btrfs does not work as expected if a SELinux context is specified between mount options, and libmount uses the classic mount(2) in this case.

Another consideration (for libmount) is backward compatibility. Let's imagine you boot an old kernel without the new interface. You probably assume functional mount(8), so libmount has to detect that the new syscalls are not available and switch back to the classic mount(2).

In the future, we could improve the mount(8) command-line interface to better reflect the mount process. Currently, we mix operation requests with mount options (e.g., -o remount,bind,ro). It would be nice to differentiate between VFS and filesystem operations. For example:

VFS operation (set /mnt and subdirectories to ro, exec):

   mount modify /mnt --recursive --set ro --unset noexec
FS operation (set superblock to ro, all instances will be ro):
   mount reconfigure /mnt -o ro 

What do you think about the "mount <oper> [options]" command-line interface?

And here are some other mount/libmount-related changes:

  • The classic writeable /etc/mtab is dead and no longer supported. Rest in peace.
  • Thanks to Christian Brauner, X-mount.idmap= is now supported, allowing you to change the ownership of all files under the mount node in the user's namespace.
  • We often use "auto" as the filesystem type in fstab, right? :-) It means we rely on libblkid/udev. In some cases, this freedom is unwanted. The new option X-mount.auto-fstypes specifies allowed or forbidden filesystem types.
  • mount(8) should now be less invasive and smarter when used on systems with an automounter and unreachable network filesystems. If you're a developer, consider using statx(AT_STATX_DONT_SYNC|AT_NO_AUTOMOUNT) as a replacement for the classic stat() if you only need very basic information about a file or directory.

Util-linux v2.39 is definitely not only about mount(8)/libmount:

  • It's time to learn something new with new commands:
    • blkpr(8) is a new command to run persistent reservations ioctls on a device (typically SCSI and NVMe disk).
    • waitpid(1) is a new command to wait for arbitrary processes.
    • pipesz(1) is a new command to set or examine pipe and FIFO buffer sizes.
  • lsfd(1), a modern Linux-only replacement for lsof, is one of the most actively developed codes in util-linux (thanks to Masatake YAMATO). It's now more user-friendly in the NAME and TYPE columns, supports pidfd, and improves network socket reporting. Try, for example, lsfd --inet -Q '(COMMAND == "systemd")'.
  • libblkid should be more robust, and now it verifies checksums for many RAIDs and filesystems to avoid automatically mounting obsolete or broken superblocks.

Significant changes have also been made to the util-linux test suite and our CI on GitHub. Thanks to Thomas Weißschuh for these and many other improvements.