Filesystems

Happy filesystems are all alike, but every corrupted filesystem is unhappy in its own way. -- Jonathan Corbet (2011)

Chapter	Section
Introduction Unix Concepts Networking LVM Filesystems OS-Level Virtualization	Introduction Filesystem Design Alternatives to Journaling Encryption The Virtual Filesystem Switch (vfs) ext, ext2, ext3, ext4 The Extents Filesystem (xfs) The Network Filesystem (nfs) Supplements Further Reading

Overview The first part of this chapter covers general concepts related to filesystems. This part is largely independent of the underlying operating system. The later sections look at some aspects of two popular local local filesystems for Linux: ext4 and xfs. The last section contains selected topics related to the network filesystem, nfs.

Introduction

Every Unix system has at least one filesystem which contains the root of the tree of files. This filesystem, the root file system, is normally stored on a local block device, for example on an SSD. On top of that, several additional filesystems of different types are mounted usually. For example, most Linux distributions mount the proc and sys pseudo filesystems and several instances of tmpfs. Other filesystems may be mounted as needed.

Classification

The Linux kernel supports several dozens of different filesystems and new ones are added frequently. Some filesystems are only employed for special purpose computers while others are in use on almost all systems. The /proc/filesystems pseudo file contains the list of supported filesystems. We don't aim to provide a full listing but classify filesystems as belonging to exactly one of the following categories.

local: The filesystem is stored on a local block device, and only the local computer can access it. Examples: ext4, xfs, fat.
pseudo: These filesystems are characterized by the absence of backing storage. That is, there is no block device which stores the contents. Instead, contents exist only in memory, or are provided on demand (i.e., when the files are accessed). When the filesystem is unmounted, all data is lost. Examples: tmpfs, sysfs, proc.
network: This type of filesystem makes files which are physically stored on a different computer visible on the local computer. Examples: nfs, cifs.
fuse (filesystem in user space): Contents are provided by a user space application. Examples: sshfs.
distributed: The contents are stored on more than one computer. Examples: glusterfs, lustre, nfs-4.1.

POSIX Filesystems

Regardless of the category, most native Unix filesystems support the semantics prescribed by the POSIX.1-2008 standard. These include open(2), read(2), write(2), stat(2) system calls and many more. In particular, a POSIX filesystem must store in each file's metadata the user and group ID of the owner and the usual three timestamps. For compatibility reasons this is not possible for "foreign" filesystems like Microsoft's FAT which was designed for the single-user DOS operating system and thus has no concept of file ownership.

User, Group, Directory and Project Quotas

Early Unix systems already supported user and group quotas while directory and project quotas are comparatively new concepts. User and group quotas impose an upper bound on the files owned by a specific user or Unix group, respectively. Directory quotas restrict the size of an accounted directory. Project quotas are similar to directory quotas but are not restricted to single directories. They are realized as an aggregation of unrelated inodes with a specific identifier, the project ID, which is stored in each accounted inode. It is possible for arbitrary files to have the same project ID and hence be accounted to the same project. The project ID is independent from the UID and the GID, hence project accounting is independent of user and group accounting.

For each quota type there are two configurable limits: the inode limit which imposes a bound on the number of files and directories owned by a specific user/group/project ID, and the block limit which bounds the space that the same set of files is permitted to occupy. Each limit is in fact a pair of limits: the soft limit and the hard limit. When utilization reaches the soft limit, a message is sent but no further action is taken. Only if the hard limit is reached, subsequent attempts to request more resources fail with the EDQUOT (Disk quota exceeded) error.

Filesystem Design

In this section we take a closer look at the data structures of local filesystems, their on-disk layout, and some of the techniques that make filesystems fast and robust.

Superblocks

The superblock describes the geometry of the file system. Only one superblock is needed for normal operation but most local filesystems store backup copies of the superblock. These extra superblocks are only accessed during recovery from filesystem corruption, for example if the main superblock was overwritten by accident. Although the details vary between filesystem types, the following information is typically stored in the superblock:

The magic number, or signature which identifies the block device as containing a filesystem of a certain type. For example, the magic number for ext2 is 0xef534. Newer filesystems use larger signatures.
The size of the filesystem.
The last mount time.
A universally unique identifier (UUID) which is created randomly at filesystem creation time.
Utilization counts like the number of free blocks.

The blkid library, libblkid, contains a database of filesystems and of other software which stores its metadata in a specific superblock of a block device. This includes filesystems, swap devices, physical volumes for software raid or the logical volume manager. The library enables applications to identify the contents of on a block device and to extract additional information like the UUID. There are a number of tools which are linked against this library to examine or modify the superblocks: lsblk(8) to list block devices, blkid(8) to print block device properties, and wipefs(8) to overwrite (or merely print) all superblocks of a given block device. Also the mount(8) executable is usually linked against libblkid to support mounting by UUID instead of device name because device names like /dev/sda might change across reboots.

B-trees and Variants

Most filesystems, including the ext4 and xfs filesystems described in dedicated sections, employ some B-tree variant to manage their data blocks. This is reason enough to take a closer look at the key features of this ubiquitous data structure. We won't go into detail, though.

B-trees were invented in the 1970s by Rudolf Bayer and Ed McCreight as a data structure for an algorithm that can quickly access a random block in a particular file stored on on a rotating disk. The parameters can be tuned to minimize the number of disk operations performed, i.e., the height of the tree.

It is unclear what the "B" in "B-tree" actually means. It certainly does not mean "binary", because B-trees typically have a high fanout, so nodes have way more than two children (which, by definition, is the maximal number of child nodes for a binary tree). The typical fanout values for filesystems range from 100 to 1000.

B-trees impose a fixed lower bound on the number of child nodes, and an upper bound that is twice as large as the lower bound. The upper bound is called the order of the tree. These bounds imply an upper bound for the maximal height of the tree, given the number of leaf nodes. Hence a B-tree is always well balanced, lookup times are always optimal (i.e., logarithmic), and storage utilization is at least 50%. Unlike a hash table approach there is no decrease in performance when utilization approaches 100%.

Addition and removal of nodes is performed in a way that keeps the tree balanced. For example, if a node is being removed and this results in a violation of the lower bound on the number of child nodes of the parent node, nodes are moved between siblings, or siblings are merged.

A node with k children always has k - 1 keys. For filesystems, the keys can be block numbers, hashes, directory entries, or the size of block ranges. In any case, the keys act as separators which divide the child nodes. For example, if the node contains 3 child nodes and the two keys 23 and 42, then the left child node contains keys less than 23, the middle child node contains keys between 23 and 42, and the right child node contains keys greater than 42.

Many filesystems, including xfs, do not use the classic B-tree structure outlined above but its variant called B+tree. The main differences between a B-tree and a B+tree are (a) data records are stored only in leaf nodes, and (b) leaves are linked together so that all data records can be traversed in-order by following sibling pointers. This little detail has far-reaching consequences. Roughly speaking, the sibling pointers prohibit copy on write for metadata blocks.

Journaling

Single filesystem operations often need to update multiple blocks. For example, adding a new file to a directory requires three block updates: the data block which contains the new file's contents, the metadata block which contains the directory entries, and the on-disk structures that manage the free blocks. Regardless of the order in which these updates are performed, an unclean shutdown due to a power outage or a system crash leads to a corrupt filesystem if the shutdown happens after the first but before the third update was performed. Journaling is a capability which avoids this situation by making multiple block updates atomic, thereby ensuring consistency after an unclean shutdown.

One of the first journaling filesystems was jfs, introduced 1990 by IBM for the AIX operating system. The first journaling filesystem for Linux was reiserfs version 3, which was included in the Linux kernel in 2001. In the same year Linux gained support for three additional journaling filesystems: jfs and xfs were ported to Linux from AIX and IRIX, respectively, and the first stable version of ext3 was released.

Journaling filesystems retain consistency after an unclean shutdown by keeping a journal (also known as log) of operations being performed. The journal is either stored on an separate device or in a reserved area within the filesystem. The entries of the journal, the log records, describe transactions, which are filesystem operations that must be performed atomically. At the next mount after an unclean shutdown, the journal is replayed, that is, the recorded transactions are reapplied. The time required to replay the journal depends only on the size of the journal and the number of log records, but not on the size of the filesystem or the number of files. It usually takes only a couple of seconds, which is considerably faster than a filesystem check/repair run, which can take hours.

Although a journaling filesystem writes metadata twice, this can actually increase performance because metadata writes to the journal are sequential, and when the log entries are committed, writes can be combined and reordered, which is a win not only for rotating disks. Since data integrity is usually less important than filesystem integrity, only metadata (inodes, directory contents) is journaled by default while data blocks (file contents) are written directly.

Delayed Allocation

This term refers to the technique of deferring the decision of which blocks to allocate until the last possible moment, when blocks have to be written out due to memory pressure or an explicit sync request from user space. This technique is employed by xfs and ext4 but not by earlier versions of the ext* family.

Delayed allocation improves the performance of the filesystem because the allocator has better knowledge of the eventual file size, so files are more likely to be laid out in an optimal way: data blocks sequentially, and close to the metadata blocks that describe the file. Moreover, small temporary files can be fully buffered in memory and don't cause any block allocation at all if the in-memory data structures have already been removed when writeout triggers. Delayed allocation is more effective on large memory systems because with plenty of memory available, allocations can be deferred for longer.

Exercises

On a Linux system, run cat /proc/filesystems to see all file systems supported on this system.
Discuss to which extent snapshot-capable filesystems can replace off-site backups.
Run mount | awk '{print $5}' | sort | uniq to examine the different filesystem types which are mounted. Explain the purpose of each and determine to which of the above classes the filesystem belongs.
Understand the impact on performance of the atime mount option. Check which file systems are mounted with atime enabled.
Discuss the contents of /etc/mtab, and /proc/filesystems. Check whether /etc/mtab is a symbolic link. Discuss whether it should be one.
Does a read-only mount of a journaling filesystem modify the contents of the underlying block device?

Homework

Describe the various file locking types mandated by POSIX-2008.1.

Homework

Summarize what POSIX sync(2), fsync(2) and fdatasync(2) guarantee.
Consider the following commands: touch foo; echo bar > foo; sync; mv foo bar; echo baz > foo; fsync foo. Assume the system crashed at some point in between. Describe the possible states of the two files after the journal has been replayed.

Homework

Delayed logging is a feature of ext3 which was later ported to xfs. The term refers to the concept of accumulating changes in memory before writing them to the journal. Discuss the pros and cons of delayed logging.

Homework

Explain what reflink copies are and discuss when and how to use them.

Alternatives to Journaling

Although our focus lies in topics related to journaling filesystems, we shall have a brief look at two different (but related) approaches which also guarantee consistency after a crash. Both approaches depart from the traditional way of updating data and metadata structures. To illustrate the difference, consider the situation where a line is appended to an existing text file on a traditional filesystem. The filesystem first updates the data blocks which contains the file contents. Since the file size and the modification time have changed, the inode of the file needs to be updated as well. Finally, the on-disk data structures which keep track of the used and unused blocks might also need to be updated if a new block had to be allocated for the additional line. All three updates are usually performed in-place by overwriting the existing blocks. The filesystems described in this section are different in that they avoid such in-place changes.

Log-structured Filesystems

A log-structured filesystem only writes sequential log entries which describe the changes that have been made, essentially treating the entire space as the journal. The first log-structured filesystem was developed in the 1990s. Popular log-structured filesystems in use today are logfs, ubifs (the unsorted block image filesystem), and f2fs (the flash-friendly filesystem).

The layout of the data and metadata blocks of a log-structured filesystem bears advantages and disadvantages. One advantage is that writes are always sequential, which is particularly good for rotating disks. However, reading a large file sequentially becomes slower because of data fragmentation. The purely sequential writes also reduce the decay of the storage media, particularly flash memory, because without in-place writes, all blocks are written about the same number of times. Other advantages of log-structured filesystem are that crash recovery is conceptionally easy and that snapshots are natural.

The main disadvantage is the overhead incurred by garbage collection: getting rid of old log entries that have been superseded by later ones. This overhead is particularly large if free space is short. Another disadvantage is related to inode management: since inodes are scattered throughout the disk, and the location of an inode changes whenever the file is updated, some kind of inode map is required to locate inodes. Updates to the inode map can be written to the log sequentially, but the current location of the inode map must be stored in a special fixed location called the checkpoint region so that the filesystem can recover from a crash. Since every write to the checkpoint region causes a seek, log-structured filesystems update the inode map only once in a while, for example once per minute. This process is known as checkpointing.

Copy on Write Filesystems

In the context of filesystems, the term copy on write (CoW) means to not overwrite existing data or metadata blocks as files are modified. Instead, whenever the filesystem needs to modify a data or metadata block, it writes the modified block to different, currently unused location, leaving the contents of the original block intact. Next, the metadata blocks that need to be altered are also written to a free location without overwriting existing metadata blocks. For example, in the scenario outlined above where the write(2) system call extends an existing file, the three updates for data, inode and free space information are performed as writes to unused locations. Next, the filesystem switches to the new metadata to commit the change. CoW filesystems are always consistent if this last step can be performed atomically.

Two well-known open source CoW filesystem are zfs and btrfs (the B-tree filesystem). The first stable release of zfs for the Solaris operating system appeared in 2005, after four years of development by Sun Microsystems. Since then zfs has been ported to several other operating systems including FreeBSD and Linux. However, the zfs code is licensed under the CDDL license, which is regarded as incompatible with the GPL. For this reason, zfs is not included in the official Linux operating system kernel but is available as a third-party kernel module. Btrfs is another CoW filesystem which was designed for Linux from the start. It was merged into the Linux kernel in 2009. The feature sets of zfs and btrfs are similar, but the implementation details vary.

CoW filesystems also have disadvantages. On a system where multiple processes append data to different files simultaneously, the data blocks of each file will be fragmented, which is bad for performance. For the same reason, metadata blocks are fragmented as well, so performance suffers if the filesystem contains many files. Another disadvantage is related to the fact that it is difficult to tell how much space is going to be needed for a given CoW operation, which has caused an endless list of bugs that occur when disk space gets tight. This is why CoW filesystems should never use more than a certain ratio of the available space. For zfs the recommended limit is as low as 70%.

Exercises

zfs and btrfs are not only filesystems: they include many other features which are traditionally performed by the volume management subsystem or by the raid driver. This includes device group handling, snapshotting, mirroring to get redundancy, striping to increase performance, and more.
Search the web for "blatant layering violation" and glance over the (sometimes heated) discussions on whether or not it is a good idea to combine volume management, raid and filesystems. Then form your own opinion about the topic.

Encryption

The dm-crypt device mapper target, which was covered in the chapter on LVM, operates at the block level. It encrypts and decrypts one block at a time, regardless of whether the block is in use. This is in contrast to filesystem-level encryption, where encryption is performed by the filesystem on a per inode basis so that, for example, different files can be encrypted with different keys.

In this section we look at two filesystem-level encryption primitives for Linux: ecryptfs (the enterprise cryptographic filesystem) and fscrypt (filesystem encryption). The former was included into Linux in 2006 while the latter is much newer. It was originally part of the flash-friendly filesystem (f2fs) but has been made generic in 2015. Besides f2fs, also ext4 and ubifs rely on fscrypt for encryption.

By definition, filesystem-level encryption means to encrypt the contents of regular files. In addition to that, both ecryptfs and fscrypt also encrypt file names. However, information stored in the inode is left unencrypted. Therefore, without the encryption key it is still possible to list directories as usual, but the list will contain only encrypted filenames. Moreover, file size and timestamps can be read as for unencrypted files with the standard stat(2) system call.

ecryptfs

ecryptfs is a so-called stacked filesystem. That is, it relies on an (arbitrary) mounted filesystem as backend storage. An ecryptfs mount is similar to a bind mount in that it makes the files stored at source location (the mountpoint of the backend storage) visible at the target location. However, the source usually contains only encrypted files while files appear unencrypted at the target location. Each encrypted file is self-contained in the sense that it starts with a header which, together with the encryption key, is sufficient to decrypt the file (and the file name). Hence encrypted files can be copied between hosts, and encrypted files can be backed up without telling the backup software the encryption key.

fscrypt

fscrypt takes a different approach. It provides encryption through a general library that can, in principle, be used by any filesystem. With fscrypt it is possible to store both encrypted and unencrypted files on the same filesystem. fscrypt has a lower memory footprint than ecryptfs since it avoids caching filesystem contents twice. Also, only half as many directory entries and inodes are needed. Another advantage of fscrypt is that the fscrypt API can be used by unprivileged users, with no need to mount a second filesystem. The major drawback of fscrypt is that open(2) system call fails without the key. Since backup software has to open regular files, it is not possible to backup encrypted files without the encryption key.

Exercises

Explain why the root directory of the filesystem cannot be encrypted with fscrypt.

Homework

Discuss the pros and cons of filesystem level encryption vs. block level encryption.

The Virtual Filesystem Switch (vfs)

The main task of the vfs is to provide an abstraction for applications to access files in a uniform way, even if the files are stored on different filesystems. The vfs is responsible for parsing path names received from user space via system calls, and to forward the requests it can not handle itself to the specific filesystem implementation, thereby associating paths with instances of mounted filesystems. This encourages a modular filesystem design where filesystems are opaque entities which provide a certain set of methods called filesystem operations, for example mounting the filesystem or opening a file. The modular design helps to avoid code duplication, which is important for operating systems like Linux which support many different filesystems.

The first vfs implementation was probably shipped in the Solaris Unix System in 1985. Linux got its first vfs implementation together with the extended filesystem, the predecessor of ext2, ext3 and ext4. All modern operating systems have some sort of vfs, although implementation details differ. In what follows, we shall only discuss the Linux vfs.

Filesystems register themselves with the vfs at boot time or when the kernel module for the filesystem is loaded. The vfs keeps track of the available filesystem types and all mounts. To perform efficiently, the vfs maintains several data structures which describe the characteristics of the tree of files. We look at the most important data structures below but leave out the rather complicated details about how the various locking primitives (spinlocks, refcounts, RCU) are employed to deal with concurrency.

The Dentry Cache

A dentry (short for "directory entry") is a data structure which represents a file or a directory. Dentries contain pointers to the corresponding inode and to the parent dentry. The vfs maintains the dentry cache, which is independent of the normal page cache that keeps copies of file contents in memory. Dentries are kept in hashed lists to make directory lookups fast. Dentries are also reference-counted. As long as there is a reference on a dentry, it can not be pruned from the dentry cache. Unreferenced dentries, however, can be evicted from the cache at any time due to memory pressure. Each dentry also has a "looked up" flag which enables the VFS to evict dentries which have never been looked up earlier than those which have.

On a busy system the dentry cache changes frequently. For example, file creation, removal and rename all trigger an update of the dentry cache. Clearly, some sort of coordination is needed to keep the dentry cache consistent in view of concurrent changes, like a file being deleted on one CPU and looked up on another. A global lock would scale very poorly, so a more sophisticated method called RCU-walk is employed. With RCU, lookups can be performed without taking locks, and read operations can proceed in parallel with concurrent writers.

The dentry cache also contains negative entries which represent nonexistent paths which were recently looked up unsuccessfully. When a user space program tries to access such a path again, the ENOENT error can be returned without involving the filesystem. Since lookups of nonexistent files happen frequently, failing such lookups quickly enhances performance. For example import statements from interpreted languages like Python benefit from the negative entries of the dentry cache because the requested files have to be looked up in several directories. Naturally, negative dentries do not point to any inode.

File and Inode Objects

Positive entries in the dentry cache point to inode objects, which are in-memory copies of the on-disk inode structures maintained by the filesystem. Different dentry cache entries can map to the same inode object if the underlying filesystem supports hard links, but entries which refer to directories are unique. The stat(2) system call can be served without calling into the filesystem if the path argument of the system call corresponds to an entry of the dentry cache.

When a file is opened, the vfs allocates a file object (also known as file description and struct file), and adds a reference to the file object to the calling process' table of open files. The index to this table is returned as the file descriptor from the open(2) system call. The file object contains a reference to the dentry and to the filesystem specific methods for the usual operations like read(2) and write(2). Also the file offset and the file status flags (O_NONBLOCK, O_SYNC, etc.) are recorded in the file object.

Like dentries, file objects are reference-counted. Once the counter hits zero, the file object is freed. System calls like dup(2) and fork(2) increase the reference count while close(2) and exit(2) decrease it. However, not all file object references correspond to file descriptors, since also the kernel itself can hold references to a file object. For example, the loop device driver and the ecryptfs stacked filesystem increase the reference counter of the file objects they work with. Passing a file descriptor from one process to another via local sockets is another situation where the reference counters of the affected file object need to be adjusted.

When an application calls dup(2) to duplicate a file descriptor, the two copies refer to the same file object and therefore share the file offset and the status flags of the file object. An open(2) call, however, creates a new file object even if the file is already open.

vfs Mounts and vfs Superblocks

Another job of the vfs is to keep track of the tree of all mounts and their properties. Do do so, the vfs maintains a tree of mount structures. Whenever a filesystem is mounted, one such structure is allocated and linked into the tree. Among other information, the mount structure contains various pointers, including pointers to

the mount structure of the parent mount,
the dentry that corresponds to the mountpoint (the root of the mounted filesystem),
the superblock of the mounted filesystem,
the containing mount namespace.

Other information stored in the mount structure:

the list of child mounts,
the mount propagation rules,
the mount flags (MS_RDONLY, MS_NOEXEC, MS_NOATIME, etc.).

Since a filesystem can be bind-mounted, there can be several mount structures whose superblock pointers point to the same superblock. The superblock structure contains the UUID, quota information, granularity of timestamps, and much more.

Exercises

Run a command like strace -e%file ls to see the (negative) dentry cache in action.
Guess which system calls the df(1) command needs to perform to do its work. Confirm by running strace df.
The mountpoint(1) command determines if a given directory is the mountpoint of a filesystem. Explain how the command is implemented.
Assume a process opens a file by calling open(2), then forks so that the child process inherits the file descriptor. Assume further that the child calls lseek(2) to reposition the file offset. Will the parent process see the modified offset?
Is it safe to run fsck(8) on a filesystem which has been lazily unmounted (by executing mount -l)?
What is the purpose of the UUID of a filesystem? Why is it generally a good idea to put the UUID instead of the device name as the first field in /etc/fstab? Figure out how to print the UUID of an ext4 and an xfs filesystem.
When a block device containing a filesystem is cloned with dd(8), the two UUIDs match. The same is true if the block device was snapshotted with LVM (regardless of whether thin or regular snapshots were used). Discuss the consequences of this fact.
For a network filesystem like nfs, a dentry can become invalid if the corresponding file was modified on the server or by another client. Discuss the implications of this fact with respect to the dentry cache.
The umount -a command unmounts all filesystems which are not busy. Describe an algorithm which walks the mount tree to implement the command, taking into account that a mount is busy if it has at least one child mount.

Homework

Describe the concept of a file change notification system and discuss possible use cases. The Linux kernel contains three different file change notification APIs: dnotify, inotify and fsnotify. Explain the difference and describe in one paragraph how applications make use of certain file descriptors to communicate with the kernel in order to track file change events.

ext, ext2, ext3, ext4

When the first versions of Linux were released in 1991, the kernel relied on the minix filesystem whose source code could be used freely in education. However, this filesystem was designed for education rather than for real use and had severe limitations. For example it supported only file names up to 14 characters long, and a total size of 64M.

In 1992 the extended filesystem (ext) was released as the first filesystem that was created specifically for Linux. It already made use of the VFS API that was introduced at the same time. In 1993, ext was superseded by ext2, and later by ext3 (2001) and ext4 (2008). While ext2 is a separate filesystem, the ext3 and ext4 filesystems share the same implementation.

Over time, many new features were added to ext3 and later to ext4. For example, journaling was one of the main features that ext3 added on top of ext2 while delayed allocation and on-disk data structure checksumming came with ext4. In the remainder of this section we look at the way the ext* filesystems lay out their data and metadata blocks and describe some of the features of ext3 and ext4.

Block Layout

All ext* filesystems lay out the space of the underlying block device in a traditional way, inspired by the original Unix filesystem, ufs. The available blocks are partitioned into block groups. Each block group is typically 128M large and always contains its own set of tables and bitmaps that manage the blocks of the group. If feasible, the data blocks of each file are kept in the same block group as its inode and the containing directory to avoid unnecessary seeks. As of ext4, block groups can be combined to form larger, so-called flexible block groups to improve metadata locality and to have large files laid out sequentially.

Inodes are referenced in the inode table, and a bitmap keeps track of allocated and unallocated inodes. A copy of the filesystem superblock is stored in several other block groups since the superblock is critical for the integrity of the filesystem.

The directory entries of an ext or ext2 filesystem are laid out in the traditional way. That is, the directory names and associated information like file type and the inode number are listed in the data blocks that correspond to the directory. Directories can be imagined as a 3-column table where each line contains an inode number, a file type number, and the path component. Since searching a linear array performs poorly, ext3 implemented B-trees to speed up name lookups in large directories. The nodes of the tree are keyed by the hashes of the names of the directory entries.

Journaling Modes

If journaling is enabled, metadata updates are always journaled (i.e., a log record is written to the journal first). However, this is not always the case for data blocks. The data mount option for ext3 and ext4 specifies how writeout of data blocks works. The following three journaling modes offer different trade-offs between speed and data integrity.

data=journal: This is the slowest, but also the safest journaling mode. In this mode all data blocks are journaled just like the metadata blocks. Since all data blocks are written twice, this journaling mode has a substantial negative impact on performance.
data=ordered: This is the default value, which offers a good trade-off between speed and data integrity. Ordered means that metadata blocks are only updated after the corresponding data blocks have been written out. In other words, data is written directly to the filesystem (rather than the journal as with data=journal), but the update of the corresponding metadata blocks is deferred until all data blocks have been written.
data=writeback: This is the fastest mode of operation. No ordering between data and metadata blocks is enforced. Filesystem integrity is guaranteed because metadata is still journaled. However, after an unclean shutdown and the subsequent replay of the journal, files which were under writeback at the time of the crash may contain stale data. This happens if the metadata blocks have been updated to report the larger size but the corresponding data did not make it to the disk before the crash.

Extents

The ext2 and ext3 filesystems employ the traditional indirect block scheme, which is basically a table of block numbers which map to the blocks that comprise the contents of the file. One feature of ext4 are extent trees, which are a more efficient data structure to describe this mapping because file contents are often laid out sequentially on the underlying block device. In this case, if the file is large it saves metadata space if the block numbers are not stored as a long list but as extents, that is, ranges of successive blocks. Extents have to be organized in a tree to quickly map a given file offset to the block number that contains the file contents at this offset. The first few extents of the extent tree, including its root, are stored in the inode itself. Only files which need more extents require extra metadata blocks for the nodes of the extent tree.

Growing and Shrinking

Filesystems often need to be grown, but sometimes it is also handy to shrink a filesystem, for example to redistribute the storage between the logical volumes of a volume group of fixed size. A feature which distinguishes ext* from xfs is that ext2, ext3 and ext4 filesystems can be shrunk while xfs filesystems can only be grown. However, to shrink an ext* filesystem, the filesystem must be offline. That's in contrast to online growing, which is supported for both ext* and xfs.

To shrink a filesystem which is stored on a logical volume, one needs to convey the new size to both the device mapper and the filesystem. It is usually a good idea to run fsck(8) after the filesystem has been unmounted and before resize2fs is run to shrink it. After the filesystem has been shrunk, the next step is to shrink also the underlying LV. Of course the new size of the LV must not be smaller than the size of the shrunk filesystem. To avoid this from happening, for example due to rounding errors, it's best to make the LV slightly larger than necessary and then enlarge the filesystem to the maximal possible size by running resize2fs(8) without specifying the new size.

Exercises

Using default options, create an ext2, ext3 and ext4 filesystem. Compare the three superblocks by examining the output of dump2fs -h <device>.
Examine the sysfs interface of ext4 which is available through the files below /sys/fs/ext4.
Consider an application which aims to replace an existing file with the rewrite-and-replace method indicated by the code below (error checking omitted). Describe the log records which each of the four system calls generates. Explain why the approach is inherently buggy and may result in an empty file foo even if the filesystem is mounted with data=ordered (the default journaling mode). How does the auto_da_alloc mount option of ext4 try to mitigate the chance of data loss?
Why is data journaling incompatible with encryption? Hint: Commit 73b92a2a5e97 in the linux kernel repository.
Describe a scenario where ext2 is more suitable than ext3 or ext4.

Homework

Provide a step-by step procedure to set up an encrypted ext4 filesystem.
Explain how the MMP feature of ext4 helps to protect the filesystem from being multiply mounted.

The Extents Filesystem (xfs)

xfs is a journaling filesystem which was implemented in 1993 for the IRIX operating system and was ported to Linux in 2001. While IRIX was discontinued in 2005, the Linux port of xfs is actively maintained and new features and improvements are added regularly. To optimize for performance, xfs departs from the traditional approach that is followed by the ext* family. From the beginning xfs was designed for running on high-end servers where plenty of resources are available to max out even the largest and fastest storage systems, and to perform well under high load when multiple threads access the filesystem concurrently. The implementation relies on B-trees for data, metadata and free space management. xfs is not a COW filesystem, though, and does not do any form of block device management. Tasks like encryption, raid or volume management are left to the corresponding filesystem-agnostic block layer interfaces, for example MD (Software Raid) and LVM (the logical volume manager). Unlike the rather static layout of the ext* filesystems, metadata on xfs is dynamically allocated. Consequently, the metadata structures have to be discovered at mount time by walking the filesystem structure. Metadata blocks are self-describing in that each metadata object contains a unique identifier, the log sequence number, which plays the role of a timestamp. CRC32c checksums are used to detect corruption: when the block is read, the CRC32c value is recomputed and checked to verify to integrity of the object.

Allocation groups

Like the block groups of the ext* family, an xfs filesystem is divided into several "mini filesystems" called Allocation Groups (AGs). This allows xfs to handle operations in parallel, which improves performance if many unrelated processes access the filesystem simultaneously. New directories, are always placed in a different AG than its parent and the inodes of the files in the new directory are clustered around the directory if possible. AGs can be up to 1T large, which is much larger than the block groups of the ext* family, but still small enough to allow relative AG pointers to be 32 bits wide rather than 64. Each AG maintains its own superblock and its own set of B-trees for resource management. Files and directories are allowed to span multiple AGs, so the AG size does not limit the maximal file size. The first AG is the primary AG. Its superblock is special in that it stores the accumulated counters of all AGs. The secondary superblocks are only consulted by xfs_repair(8).

Project Quota Implementation

Project quotas used to be an xfs feature, but the functionality has been made generic and is therefore available to other filesystems as well. Besides xfs, also ext4 supports project quotas.

To limit the size of an arbitrary subtree, a special inode flag, XFS_DIFLAG_PROJINHERIT, is used. This flag indicates that the directory and all inodes created in the directory inherit the project ID of the directory. Hence the act of creating a file in a XFS_DIFLAG_PROJINHERIT marked directory associates the new file with s a specific project ID. New directories also get marked with XFS_DIFLAG_PROJINHERIT so the behaviour is propagated down the directory tree.

Project quota is accounted for when moving into an accounted directory tree, but not when moving out of a directory tree into an unaccounted location. Moreover, one can create hard links to an accounted file in an uncontrolled destination (as the inode is still accounted). But it is not allowed to link from an accounted directory into a destination with a different project ID.

Project IDs may be mapped to names through the /etc/projid and /etc/projects configuration files.

Speculative Preallocation

As files are being written, xfs allocates extra blocks beyond the current end of file, anticipating that further writes will arrive to extend the file. The preallocation size is dynamic and depends mainly on the size of the file. When the file is closed, the unneeded extra blocks are reclaimed.

The speculatively preallocated post-EOF blocks help to minimize file fragmentation, but they can cause confusion because they are accounted identically to other blocks, making files appear to use more data blocks than expected.

If the system crashes while preallocated post-EOF blocks exist, the space will be recovered the next time the affected file gets closed (after it has been opened of course) by the normal reclaim mechanism which happens when a file is being closed.

Reverse Mapping

This feature was implemented in 2018. It adds yet another B-tree to the xfs on-disk data structures: the reverse mapping tree, which allows the filesystem to look up the owner of a given block (if any). For example, if the underlying storage device reports that a certain block went bad, and that block happens to contain contents of a regular file, the reverse mapping tree yields the corresponding inode and the file offset.

Another use of the reverse mapping tree is filesystem scrubbing, a data integrity technique where a kernel thread runs in the background to check the on-disk data structures for consistency while the filesystem is mounted.

Since reverse mapping imposes some performance overhead, the feature is disabled by default.

mkfs Options

Generally, the default settings are suitable for most workloads, so there is usually no need for manual optimization. Nevertheless, many xfs parameters can be tweaked at filesystem creation time. The following list describes some options to mkfs(8).

AG count. The optimal number of AGs for a given block device depends on the underlying storage. Since a single device has limited IO concurrency capability while raid devices allow for much more concurrency, the number of AGs for the single device should be smaller. mkfs.xfs(8) tries to figure out the characteristics of the block device and pick a suitable number of AGs. However, in some cases, for example if the filesystem is stored on a hardware raid array or on a bcache device, it can not detect the geometry of the underlying storage. Explicitly specifying the number of AGs is probably a good idea in this case. As a rule of thumb, 32 AGs should be enough for most cases.
The number of AGs can not be changed after the filesystem has been created. However, when an xfs filesystem is grown, new AGs are added, and care should be taken to align the device size to a multiple of the AG size. Generally, one should not increase the filesystem many times in small steps.
Stripe unit and stripe width. mkfs.xfs(8) tries hard to choose reasonable defaults, but for hardware raid arrays, the command has no way to tell the raid level and the number of data disks in the array. Without this number it is beneficial to specify the sunit and swidth options to mkfs.xfs(8).

Exercises

Run df -h $P and df -hi $P, where P is a path on which project quotas are enforced.
Create two directories and set up project quotas for both, using different project IDs. Guess what happens if one tries to hard link two files between the two directories. Verify by running a suitable ln command.
Run xfs_bmap -v file for a large file on an xfs filesystem to see the extents of the file.
Run xfs_logprint -t to dump the log records of a busy xfs filesystem.
Run xfs_info(8) on an xfs mountpoint and examine the values shown in the data section.
Which of the three journaling modes of ext4 (if any) corresponds to the journaling mode of xfs?
Compile the xfsprogs and xfstests packages. Run xfstests on an empty xfs filesystem.
Run xfs_info(8) on an existing xfs filesystem and determine whether the device size is an integer multiple of the AG size. Discuss the relevance of this property.
Assume you'd like to create a ~100T large xfs filesystem on a logical volume so that the device size is an integer multiple of the AG size. Come up with suitable lvcreate and mkfs.xfs(8) commands to achieve this.
Given a path to a file on an xfs filesystem that is mounted with project quotas enabled, how can one determine the project ID of the file?

Homework

Summarize how the reflink feature is implemented in xfs.

Homework

Explain how xfs metadumps work and which parts of the filesystem are included in the dump. Provide a formula to estimate the expected size of the dump, given the outputs of xfs_info(8) and df(1).

The Network Filesystem (nfs)

The nfs service allows computers to mount a directory located on a remote server as if it were a local disk, allowing file sharing over a (typically local) network. On Linux, both the server and the client are part of the operating system kernel, but there are also nfs implementations which operate in user space. The nfs protocol is an open specification, available as a set of RFCs, that has been implemented on various operating systems including Linux, FreeBSD and MacOS. Server and client can run different operating systems as long as they both support a common nfs protocol version. The original nfs protocol was designed in 1984 by Sun Microsystems for the Sun operating system (SunOS). This version was never released and was only deployed inside the company. Protocol version 2 was released in 1989. It is long obsolete due to its severe limitations, for example it had a maximal file size of 2G. Its successor, protocol version 3, was released in 1995 and fixed most of the limitations. This version is still in use, although nfs protocol version 4 (called nfs4 in what follows) is most frequently deployed these days. It was released in 2000 and contains several performance, robustness and security improvements over the older versions. The authorative resource for the gory details of nfs4 is RFC 7530. The nfs protocol is still under active development. Protocol version 4.1 was released in 2010 and version 4.2 followed in 2016.

rpc and xdr

The nfs protocols are built on top of a concept called remote procedure call (rpc), which is based on an encoding format known as external data representation (xdr). The rpcs which are provided by the nfs server are closely related to filesystem-specific system calls like read(2), write(2), link(2), rename(2), mkdir(2) etc. Therefore an introduction to nfs naturally starts with rpc and xdr.

The functionality of a network service can often be divided into the low-level part and the application-level part. The low-level part talks to the kernel to establish the connection and to send and receive data, using system calls like socket(2), bind(2), listen(2), connect(2), recv(2), send(2), etc. This part is independent of the application layer which is only concerned with the network protocol of the service. For a service like nfs which combines more than one network protocol, it makes sense to abstract out the common low-level part. The rpc framework was designed in 1976 to provide such an abstraction. It supports a variety of transports including tcp and udp. With rpc, a program running on one computer can execute a function on a different computer. The functions that can be called in this manner, the rpc services, are identified by a program number and the version number. Originally developed by Sun Microsystems in the 1980s, rpc is still in use today, sometimes still under the old "sunrpc" name.

In general, the called procedure runs on a different system as the calling procedure, so the client and server processes don't share the same address space, and no memory references can be passed. Instead, data structures must be serialized first, i.e. converted to a certain transfer format that can be stored in a single memory buffer, the xdr buffer, which is then sent to the server. The received xdr buffer is de-serialized (decoded) by the server, possibly in a different way. For example, the server might store the bytes which describe an integer value in a different order than the client to meet the requirements of its CPU (little/big endian). The xdr API offers routines to convert many predefined data types (int, string, etc.) to an xdr buffer or vice versa. This unburdens the programmer from such details as much as possible.

To activate rpc on a system, the rpcbind(8) daemon (formerly known as portmapper) must be running. This daemon manages the various procedures employed by nfs such as mount, lock manager, quota daemon, and the nfs procedure itself. It communicates with rpc clients by listening on a well-known port. Clients first send a get_port request to rpcbind(8) in order to find out the port number which corresponds to the procedure they are interested in. For example, an nfs client which intends to mount an nfs-exported directory requests the port number of the mount procedure from rpcbind(8). A second request is then made to actually mount the filesystem. The exercises of this section ask the reader to run the rpcinfo(8) tool to show the available procedures and their port numbers on the specified server.

The input format for rpc is the rpc language (rpcl), which is similar to C. This format fully describes the application protocol, including all procedures and data types. RFC 7531 contains the full xdr description of nfs4 in rpcl. The rpcgen(1) protocol compiler generates C code from rpcl input. The C code is then compiled to generate application code which implements the protocol described by the input. rpcgen(8) offers multiple application interfaces which provide different degrees of control over the rpc internals.

Stateless and Stateful Protocols

The nfs protocol versions 2 and 3 are stateless, which means that that by design the server does not keep track of what clients do. For example, the server does not remember which files are currently open. Instead, the client tracks open files and the current offset of each open file, translating application requests into suitable protocol messages. While statelessness simplifies crash recovery for both the client and the server, it also has downsides. For example, file locking requires the server to maintain the existing locks and the list of clients which are holding them. Since this can not be done with a stateless protocol, another rpc service, the lock daemon (lockd), was added. To recover the state of locks after a server reboot, yet another rpc service, the status daemon (statd), had to be introduced. This design added complexity for no real gain, which is why nfs4 departed from the previous versions by introducing state. With a stateful protocol it became possible to combine all related rpc services into a single service which uses a single TCP port. This simplifies the implementation and also allows for compound operations where the client sends more than one request in a singe rpc call.

Identifiers and File Handles

File handles describe the file or directory a particular operation is going to operate upon. For the nfs clients, file handles are opaque blobs that can only be tested for equality, but which can not be interpreted in any way. However, for the nfs server a file handle identifies the corresponding file or directory. Most protocol requests include a file handle. For example, the LOOKUP and MKDIR operations both return a file handle to the nfs client.

A file handle consists of three identifiers: a filesystem ID, an inode number and the so-called generation number. The filesystem ID and the inode number also exist for local files. They are derived from the statvfs structure that describes the exported filesystem and the stat structure of the inode, respectively. The generation number, however, is only needed for network file systems. Roughly speaking, the generation number counts how many times the inode has been re-used. This is necessary to prevent clients from accessing a file through an existing file handle if the file was deleted on the server and its inode number has been re-used subsequently. File handles are based on leases: The client periodically talks to the server to update its leases.

There is a deep interaction between file handles and the dentry cache of the vfs. Without nfs, a filesystem can rely on the following "closure" property: For any positive dentry, all its parent directories are also positive dentries. This is no longer true if a filesystem is exported. Therefore the filesystem maps any file handles sent to nfs clients to disconnected dentries. Any process whose cwd is on a local fs contributes to the reference counter of the dentry that corresponds to the directory, and thus prevents the filesystem from being unmounted. For nfs, this is not possible. More general: remote applications need a way to refer to a particular dentry, stable across renames, truncates, and server-reboot.

Attribute Caching

Several rpcs return file attributes, i.e., the inode information which is available for local filesystems through the stat(2) system call. For example, the LOOKUP rpc returns a new file handle and the attributes that are associated with the given path, and the GETATTR rpc returns the current attributes of the file which corresponds to an existing file handle. By default, nfs clients cache these metadata. However, since metadata can change at any time due to file operations from other clients, the cached information can become stale. Therefore attributes are cached for only a few seconds, which is thus the duration of the time window during which metadata modifications caused on a different nfs client remain undetected. Reducing this time window can result in flooding the server with GETATTR requests while extending it increases the chance of returning stale cached data or metadata to the application. With the noac mount option, the client asks the server every time it needs to assess file metadata. However, the option also prohibits data caching, just like the sync option. This severely impacts performance.

Changes to directories are handled similarly. To detect when directory entries have been added or removed on the server, the client watches the directory mtime (nfsv2 and nfsv3) or change attribute (nfsv4). When the client detects a change, it drops all cached attributes for that directory. Since the directory's mtime and the change attributes are cached attributes, it may take some time before a client notices directory changes.

Data Caching and Cache Consistency

nfs clients are usually allowed to cache write operations because the write caches increase client performance significantly and reduce the load of the server at the same time, allowing it to support more clients. However, one side effect of write caching is that other clients which access the same file at the same time will not see the changes immediately. The consistency guarantees of a network file system describe the semantics of such concurrent file operations.

The nfs versions 2 and 3 provide weak cache consistency which notifies clients about changes made by other clients before and after an rpc. This concept turned out to be problematic, so nfsv4 replaced weak cache consistency by close-to-open cache consistency, which means that an nfs client is only guaranteed to see the effects of another client's write operation if it opens the file after the client that wrote to the file has closed it.

To illustrate close-to-open cache consistency, consider the scenario illustrated in the diagram on the left where three nfs clients (as indicated by colors) access the same file. The blue client opens the file and writes to it while the other two clients only perform read operations. With close-to-open cache consistency the green client is guaranteed to see the write operation of the blue client while there is no such guarantee for the red client.

File and Directory Delegations

nfs4 introduced a per-file state management feature called file delegation. Once a file has been delegated to a client, the server blocks write access to the file for other nfs clients and for local processes. Therefore the client may assume that the file does not change unexpectedly. This cache-coherency guarantee can improve performance because the client may cache all write operations for this file, and only contact the server when memory pressure forces the client to free memory by writing back file contents.

A drawback of file delegations is that they delay conflicting open requests by other clients because existing delegations must be recalled before the open request completes. This is particularly important if an nfs client which is holding a delegation gets disconnected from the network. To detect this condition, clients report to the server that they are still alive by periodically sending a RENEW request. If no such request has arrived for the lease time (typically 90 seconds), the server may recall any delegations it has granted to the disconnected client. This allows accesses from other clients that would normally be prevented because of the delegation.

However, the server is not obliged to recall uncontested delegations for clients whose lease period has expired. In fact, newer Linux NFS server implementations retain the uncontested state of unresponsive clients for up to 24 hours. This so-called courteous server feature was introduced in Linux-5.19 (released in 2022).

Let us finally remark that the delegations as discussed above work only for regular files. NFS versions up to and including 4.0 do not grant delegations for directories. With nfs4.1 an nfs client may ask the server to be notified whenever changes are made to the directory by another client. Among other benefits, this feature allows for strong directory cache coherency. However, as of 2022, directory delegations are not yet implemented by Linux.

Silly Renames and Stale File Handles

Many applications employ the following old trick to store temporary data without leaving a stale temp file behind in case the process crashes or is killed with SIGKILL. They create and open a temporary file, then call unlink(2) to disassociate the path from the filesystem tree while retaining the file descriptor for subsequent I/O operations.

With NFS this does not work because the file descriptor exists only on the client, and the server doesn't know about it. Consequently the normal unlink(2) call on the server would delete the file and free its data blocks. This is why the nfs client just renames the file to something like .nfs12345 if an application calls unlink(2) to remove it while it is still open. Only after all the last file descriptor that refers to the thusly silly-renamed file is closed, the client removes the file by issuing an appropriate rpc.

This approach is not perfect. For one, if the client crashes, a stale .nfs12345 file remains on the server. Second, since silly renames are only known to the nfs client, bad things happen if a different client removes the file. Finally, if an application running on a client removes the last regular file in a directory, and this file got silly-renamed because it was still held open, a subsequent rmdir will fail unexpectedly with Directory not empty. Version 4.1 of the NFS protocol finally got rid of silly renames: An NFS4.1 server knows when it its safe to unlink a file and communicates this information to the client.

The file handle which an nfs client received through some earlier rpc can become invalid at any time due to operations on different hosts. This happens, for example, if the file was deleted on the server or on a different nfs client, or when the directory that contains the file is no longer exported by the server due to a configuration change. Subsequent attempts to use this file handle for rpcs then fail with the ESTALE error.

The exercises below ask the reader to cause silly-renamed files, and stale file handles.

Performance Tuning

There are plenty of mount options for nfs. See nfs(5) for details. We only cover a couple of the more interesting ones with respect to performance.

soft/hard. hard: nfs requests are retried indefinitely. Soft: requests are failed eventually.
sync/async. async: nfs client delays sending application writes to the server. sync: writes cause data to be flushed to the server before the system call returns.

Exercises

Run the following commands on an nfs client and discuss the output: df -h, mount -t nfs -t nfs4.
Run /usr/sbin/rpcinfo -b 100003 3 | awk '{print $2}' | sort | uniq to list all NFS servers.
Explain the difference between the sync mount option for nfs and the sync export option of nfsd.
Run the following commands on an nfs server and discuss the output: /sbin/showmount -e, rpcinfo.
On an nfs server, run collectl -s F -i 5 and discuss the output.
In an nfs-mounted directory (nfs version 4.0 or earlier), run cat > foo &. Note that the cat process automatically receives the STOP signal. Run rm foo; ls -ltra. Read section D2 of the nfs HOWTO for the explanation.
In an nfs-mounted directory, run { while :; do echo; sleep 1; done; } > baz &. What happens if you remove the file on a different nfs client?
Discuss the pros and cons of hard vs. soft mounts.
Read section A10 of the nfs HOWTO to learn about common reasons for stale nfs handles.
Can every local filesystem be exported via nfs?
What's that readdirplus thing? Describe a scenario where it is beneficial and another scenario where it hurts performance.

Homework

For each POSIX lock type, discuss whether file locks of this type work on an nfs-mounted filesystem. If they do, discuss the relevant mount options that are necessary to make them work.

Homework

Use the simplest interface that rpcgen has to offer to write a protocol in rpcl which allows a client to retrieve the load value of the server as a string (contents of /proc/loadavg).
Write the same program, but this time use the expert interface and and pass each value of /proc/loadavg as a number of suitable type.

Homework

Describe the purpose of the nfsd and the rpc_pipefs pseudo filesystems. Hint: See Documentation/filesystems/nfs/nfs.txt of the Linux source code.

Homework

Provide an overview of nfs version 4.1 (Parallel nfs).

Supplements

Broken Rename

	fd = open("foo.new", ...);
	write(fd, "new content", ...);
	close(fd);
	rename("foo.new", "foo");

Filesystems

Introduction

Classification

POSIX Filesystems

User, Group, Directory and Project Quotas

Filesystem Design

Superblocks

B-trees and Variants

Journaling

Delayed Allocation

Exercises

Homework

Homework

Homework

Homework

Alternatives to Journaling

Log-structured Filesystems

Copy on Write Filesystems

Exercises

Encryption

ecryptfs

fscrypt

Exercises

Homework

The Virtual Filesystem Switch (vfs)

The Dentry Cache

File and Inode Objects

vfs Mounts and vfs Superblocks

Exercises

Homework

ext, ext2, ext3, ext4

Block Layout

Journaling Modes

Extents

Growing and Shrinking

Exercises

Homework

The Extents Filesystem (xfs)

Allocation groups

Project Quota Implementation

Speculative Preallocation

Reverse Mapping

mkfs Options

Exercises

Homework

Homework

The Network Filesystem (nfs)

rpc and xdr

Stateless and Stateful Protocols

Identifiers and File Handles

Attribute Caching

Data Caching and Cache Consistency

File and Directory Delegations

Silly Renames and Stale File Handles

Performance Tuning

Exercises

Homework

Homework

Homework

Homework

Supplements

Broken Rename

Further Reading