FilesystemsHappy filesystems are all alike, but every corrupted filesystem is unhappy in its own way. -- Jonathan Corbet (2011) |
Every Unix system has at least one filesystem which contains the root of the tree of files. This filesystem, the root file system, is normally stored on a local block device, for example on an SSD. On top of that, several additional filesystems of different types are mounted usually. For example, most Linux distributions mount the proc and sys pseudo filesystems and several instances of tmpfs. Other filesystems may be mounted as needed.
The Linux kernel supports several dozens of different filesystems
and new ones are added frequently. Some filesystems are only employed
for special purpose computers while others are in use on almost all
systems. The /proc/filesystems
pseudo file contains the
list of supported filesystems. We don't aim to provide a full listing
but classify filesystems as belonging to exactly one of the following
categories.
Regardless of the category, most native Unix filesystems support
the semantics prescribed by the POSIX.1-2008 standard. These include
open(2), read(2), write(2), stat(2)
system calls and
many more. In particular, a POSIX filesystem must store in each
file's metadata the user and group ID of the owner and the usual
three timestamps. For compatibility reasons this is not possible for
"foreign" filesystems like Microsoft's FAT which was designed for
the single-user DOS operating system and thus has no concept of file
ownership.
Early Unix systems already supported user and group quotas while directory and project quotas are comparatively new concepts. User and group quotas impose an upper bound on the files owned by a specific user or Unix group, respectively. Directory quotas restrict the size of an accounted directory. Project quotas are similar to directory quotas but are not restricted to single directories. They are realized as an aggregation of unrelated inodes with a specific identifier, the project ID, which is stored in each accounted inode. It is possible for arbitrary files to have the same project ID and hence be accounted to the same project. The project ID is independent from the UID and the GID, hence project accounting is independent of user and group accounting.
For each quota type there are two configurable limits: the
inode limit which imposes a bound on the number of files
and directories owned by a specific user/group/project ID, and the
block limit which bounds the space that the same set of files
is permitted to occupy. Each limit is in fact a pair of limits: the
soft limit and the hard limit. When utilization
reaches the soft limit, a message is sent but no further action is
taken. Only if the hard limit is reached, subsequent attempts to
request more resources fail with the EDQUOT
(Disk
quota exceeded
) error.
The superblock describes the geometry of the file system. Only one superblock is needed for normal operation but most local filesystems store backup copies of the superblock. These extra superblocks are only accessed during recovery from filesystem corruption, for example if the main superblock was overwritten by accident. Although the details vary between filesystem types, the following information is typically stored in the superblock:
The blkid library, libblkid, contains a database of filesystems and
of other software which stores its metadata in a specific superblock
of a block device. This includes filesystems, swap devices, physical
volumes for software raid or the logical volume manager. The library
enables applications to identify the contents of on a block device
and to extract additional information like the UUID. There are a
number of tools which are linked against this library to examine
or modify the superblocks: lsblk(8)
to list block
devices, blkid(8)
to print block device properties, and
wipefs(8)
to overwrite (or merely print) all superblocks
of a given block device. Also the mount(8)
executable is
usually linked against libblkid to support mounting by UUID instead
of device name because device names like /dev/sda
might
change across reboots.
Most filesystems, including the ext4 and xfs filesystems described in dedicated sections, employ some B-tree variant to manage their data blocks. This is reason enough to take a closer look at the key features of this ubiquitous data structure. We won't go into detail, though.
B-trees were invented in the 1970s by Rudolf Bayer and Ed McCreight as a data structure for an algorithm that can quickly access a random block in a particular file stored on on a rotating disk. The parameters can be tuned to minimize the number of disk operations performed, i.e., the height of the tree.
It is unclear what the "B" in "B-tree" actually means. It certainly does not mean "binary", because B-trees typically have a high fanout, so nodes have way more than two children (which, by definition, is the maximal number of child nodes for a binary tree). The typical fanout values for filesystems range from 100 to 1000.
B-trees impose a fixed lower bound on the number of child nodes, and an upper bound that is twice as large as the lower bound. The upper bound is called the order of the tree. These bounds imply an upper bound for the maximal height of the tree, given the number of leaf nodes. Hence a B-tree is always well balanced, lookup times are always optimal (i.e., logarithmic), and storage utilization is at least 50%. Unlike a hash table approach there is no decrease in performance when utilization approaches 100%.
Addition and removal of nodes is performed in a way that keeps the tree balanced. For example, if a node is being removed and this results in a violation of the lower bound on the number of child nodes of the parent node, nodes are moved between siblings, or siblings are merged.
A node with k children always has k - 1 keys. For filesystems, the keys can be block numbers, hashes, directory entries, or the size of block ranges. In any case, the keys act as separators which divide the child nodes. For example, if the node contains 3 child nodes and the two keys 23 and 42, then the left child node contains keys less than 23, the middle child node contains keys between 23 and 42, and the right child node contains keys greater than 42.
Many filesystems, including xfs, do not use the classic B-tree structure outlined above but its variant called B+tree. The main differences between a B-tree and a B+tree are (a) data records are stored only in leaf nodes, and (b) leaves are linked together so that all data records can be traversed in-order by following sibling pointers. This little detail has far-reaching consequences. Roughly speaking, the sibling pointers prohibit copy on write for metadata blocks.
Single filesystem operations often need to update multiple blocks. For example, adding a new file to a directory requires three block updates: the data block which contains the new file's contents, the metadata block which contains the directory entries, and the on-disk structures that manage the free blocks. Regardless of the order in which these updates are performed, an unclean shutdown due to a power outage or a system crash leads to a corrupt filesystem if the shutdown happens after the first but before the third update was performed. Journaling is a capability which avoids this situation by making multiple block updates atomic, thereby ensuring consistency after an unclean shutdown.
One of the first journaling filesystems was jfs, introduced 1990 by IBM for the AIX operating system. The first journaling filesystem for Linux was reiserfs version 3, which was included in the Linux kernel in 2001. In the same year Linux gained support for three additional journaling filesystems: jfs and xfs were ported to Linux from AIX and IRIX, respectively, and the first stable version of ext3 was released.
Journaling filesystems retain consistency after an unclean shutdown by keeping a journal (also known as log) of operations being performed. The journal is either stored on an separate device or in a reserved area within the filesystem. The entries of the journal, the log records, describe transactions, which are filesystem operations that must be performed atomically. At the next mount after an unclean shutdown, the journal is replayed, that is, the recorded transactions are reapplied. The time required to replay the journal depends only on the size of the journal and the number of log records, but not on the size of the filesystem or the number of files. It usually takes only a couple of seconds, which is considerably faster than a filesystem check/repair run, which can take hours.
Although a journaling filesystem writes metadata twice, this can actually increase performance because metadata writes to the journal are sequential, and when the log entries are committed, writes can be combined and reordered, which is a win not only for rotating disks. Since data integrity is usually less important than filesystem integrity, only metadata (inodes, directory contents) is journaled by default while data blocks (file contents) are written directly.
This term refers to the technique of deferring the decision of which blocks to allocate until the last possible moment, when blocks have to be written out due to memory pressure or an explicit sync request from user space. This technique is employed by xfs and ext4 but not by earlier versions of the ext* family.
Delayed allocation improves the performance of the filesystem because the allocator has better knowledge of the eventual file size, so files are more likely to be laid out in an optimal way: data blocks sequentially, and close to the metadata blocks that describe the file. Moreover, small temporary files can be fully buffered in memory and don't cause any block allocation at all if the in-memory data structures have already been removed when writeout triggers. Delayed allocation is more effective on large memory systems because with plenty of memory available, allocations can be deferred for longer.
cat /proc/filesystems
to
see all file systems supported on this system. mount | awk '{print $5}' | sort | uniq
to
examine the different filesystem types which are mounted. Explain
the purpose of each and determine to which of the above classes the
filesystem belongs. /etc/mtab
, and
/proc/filesystems
. Check whether /etc/mtab
is a symbolic link. Discuss whether it should be one. sync(2), fsync(2)
and
fdatasync(2)
guarantee. touch foo; echo bar >
foo; sync; mv foo bar; echo baz > foo; fsync foo
. Assume the
system crashed at some point in between. Describe the possible states
of the two files after the journal has been replayed. Although our focus lies in topics related to journaling filesystems, we shall have a brief look at two different (but related) approaches which also guarantee consistency after a crash. Both approaches depart from the traditional way of updating data and metadata structures. To illustrate the difference, consider the situation where a line is appended to an existing text file on a traditional filesystem. The filesystem first updates the data blocks which contains the file contents. Since the file size and the modification time have changed, the inode of the file needs to be updated as well. Finally, the on-disk data structures which keep track of the used and unused blocks might also need to be updated if a new block had to be allocated for the additional line. All three updates are usually performed in-place by overwriting the existing blocks. The filesystems described in this section are different in that they avoid such in-place changes.
A log-structured filesystem only writes sequential log entries which describe the changes that have been made, essentially treating the entire space as the journal. The first log-structured filesystem was developed in the 1990s. Popular log-structured filesystems in use today are logfs, ubifs (the unsorted block image filesystem), and f2fs (the flash-friendly filesystem).
The layout of the data and metadata blocks of a log-structured filesystem bears advantages and disadvantages. One advantage is that writes are always sequential, which is particularly good for rotating disks. However, reading a large file sequentially becomes slower because of data fragmentation. The purely sequential writes also reduce the decay of the storage media, particularly flash memory, because without in-place writes, all blocks are written about the same number of times. Other advantages of log-structured filesystem are that crash recovery is conceptionally easy and that snapshots are natural.
The main disadvantage is the overhead incurred by garbage collection: getting rid of old log entries that have been superseded by later ones. This overhead is particularly large if free space is short. Another disadvantage is related to inode management: since inodes are scattered throughout the disk, and the location of an inode changes whenever the file is updated, some kind of inode map is required to locate inodes. Updates to the inode map can be written to the log sequentially, but the current location of the inode map must be stored in a special fixed location called the checkpoint region so that the filesystem can recover from a crash. Since every write to the checkpoint region causes a seek, log-structured filesystems update the inode map only once in a while, for example once per minute. This process is known as checkpointing.
In the context of filesystems, the term copy on write
(CoW) means to not overwrite existing data or metadata blocks as
files are modified. Instead, whenever the filesystem needs to modify
a data or metadata block, it writes the modified block to different,
currently unused location, leaving the contents of the original
block intact. Next, the metadata blocks that need to be altered
are also written to a free location without overwriting existing
metadata blocks. For example, in the scenario outlined above where the
write(2)
system call extends an existing file, the three
updates for data, inode and free space information are performed as
writes to unused locations. Next, the filesystem switches to the new
metadata to commit the change. CoW filesystems are always consistent
if this last step can be performed atomically.
Two well-known open source CoW filesystem are zfs and btrfs (the B-tree filesystem). The first stable release of zfs for the Solaris operating system appeared in 2005, after four years of development by Sun Microsystems. Since then zfs has been ported to several other operating systems including FreeBSD and Linux. However, the zfs code is licensed under the CDDL license, which is regarded as incompatible with the GPL. For this reason, zfs is not included in the official Linux operating system kernel but is available as a third-party kernel module. Btrfs is another CoW filesystem which was designed for Linux from the start. It was merged into the Linux kernel in 2009. The feature sets of zfs and btrfs are similar, but the implementation details vary.
CoW filesystems also have disadvantages. On a system where multiple processes append data to different files simultaneously, the data blocks of each file will be fragmented, which is bad for performance. For the same reason, metadata blocks are fragmented as well, so performance suffers if the filesystem contains many files. Another disadvantage is related to the fact that it is difficult to tell how much space is going to be needed for a given CoW operation, which has caused an endless list of bugs that occur when disk space gets tight. This is why CoW filesystems should never use more than a certain ratio of the available space. For zfs the recommended limit is as low as 70%.
Search the web for "blatant layering violation" and glance over the (sometimes heated) discussions on whether or not it is a good idea to combine volume management, raid and filesystems. Then form your own opinion about the topic.
The dm-crypt device mapper target, which was covered in the chapter on LVM, operates at the block level. It encrypts and decrypts one block at a time, regardless of whether the block is in use. This is in contrast to filesystem-level encryption, where encryption is performed by the filesystem on a per inode basis so that, for example, different files can be encrypted with different keys.
In this section we look at two filesystem-level encryption primitives for Linux: ecryptfs (the enterprise cryptographic filesystem) and fscrypt (filesystem encryption). The former was included into Linux in 2006 while the latter is much newer. It was originally part of the flash-friendly filesystem (f2fs) but has been made generic in 2015. Besides f2fs, also ext4 and ubifs rely on fscrypt for encryption.
By definition, filesystem-level encryption means to encrypt the
contents of regular files. In addition to that, both ecryptfs and
fscrypt also encrypt file names. However, information stored in the
inode is left unencrypted. Therefore, without the encryption key it is
still possible to list directories as usual, but the list will contain
only encrypted filenames. Moreover, file size and timestamps can be
read as for unencrypted files with the standard stat(2)
system call.
ecryptfs is a so-called stacked filesystem. That is, it relies on an (arbitrary) mounted filesystem as backend storage. An ecryptfs mount is similar to a bind mount in that it makes the files stored at source location (the mountpoint of the backend storage) visible at the target location. However, the source usually contains only encrypted files while files appear unencrypted at the target location. Each encrypted file is self-contained in the sense that it starts with a header which, together with the encryption key, is sufficient to decrypt the file (and the file name). Hence encrypted files can be copied between hosts, and encrypted files can be backed up without telling the backup software the encryption key.
fscrypt takes a different approach. It provides encryption
through a general library that can, in principle, be used by any
filesystem. With fscrypt it is possible to store both encrypted and
unencrypted files on the same filesystem. fscrypt has a lower memory
footprint than ecryptfs since it avoids caching filesystem contents
twice. Also, only half as many directory entries and inodes are
needed. Another advantage of fscrypt is that the fscrypt API can be
used by unprivileged users, with no need to mount a second filesystem.
The major drawback of fscrypt is that open(2)
system call
fails without the key. Since backup software has to open regular files,
it is not possible to backup encrypted files without the encryption
key.
The main task of the vfs is to provide an abstraction for applications to access files in a uniform way, even if the files are stored on different filesystems. The vfs is responsible for parsing path names received from user space via system calls, and to forward the requests it can not handle itself to the specific filesystem implementation, thereby associating paths with instances of mounted filesystems. This encourages a modular filesystem design where filesystems are opaque entities which provide a certain set of methods called filesystem operations, for example mounting the filesystem or opening a file. The modular design helps to avoid code duplication, which is important for operating systems like Linux which support many different filesystems.
The first vfs implementation was probably shipped in the Solaris Unix System in 1985. Linux got its first vfs implementation together with the extended filesystem, the predecessor of ext2, ext3 and ext4. All modern operating systems have some sort of vfs, although implementation details differ. In what follows, we shall only discuss the Linux vfs.
Filesystems register themselves with the vfs at boot time or when the kernel module for the filesystem is loaded. The vfs keeps track of the available filesystem types and all mounts. To perform efficiently, the vfs maintains several data structures which describe the characteristics of the tree of files. We look at the most important data structures below but leave out the rather complicated details about how the various locking primitives (spinlocks, refcounts, RCU) are employed to deal with concurrency.
A dentry (short for "directory entry") is a data structure which represents a file or a directory. Dentries contain pointers to the corresponding inode and to the parent dentry. The vfs maintains the dentry cache, which is independent of the normal page cache that keeps copies of file contents in memory. Dentries are kept in hashed lists to make directory lookups fast. Dentries are also reference-counted. As long as there is a reference on a dentry, it can not be pruned from the dentry cache. Unreferenced dentries, however, can be evicted from the cache at any time due to memory pressure. Each dentry also has a "looked up" flag which enables the VFS to evict dentries which have never been looked up earlier than those which have.
On a busy system the dentry cache changes frequently. For example, file creation, removal and rename all trigger an update of the dentry cache. Clearly, some sort of coordination is needed to keep the dentry cache consistent in view of concurrent changes, like a file being deleted on one CPU and looked up on another. A global lock would scale very poorly, so a more sophisticated method called RCU-walk is employed. With RCU, lookups can be performed without taking locks, and read operations can proceed in parallel with concurrent writers.
The dentry cache also contains negative entries
which represent nonexistent paths which were recently looked up
unsuccessfully. When a user space program tries to access such a path
again, the ENOENT
error can be returned without involving
the filesystem. Since lookups of nonexistent files happen frequently,
failing such lookups quickly enhances performance. For example
import
statements from interpreted languages like Python
benefit from the negative entries of the dentry cache because the
requested files have to be looked up in several directories. Naturally,
negative dentries do not point to any inode.
Positive entries in the dentry cache point to inode
objects, which are in-memory copies of the on-disk inode
structures maintained by the filesystem. Different dentry cache entries
can map to the same inode object if the underlying filesystem supports
hard links, but entries which refer to directories are unique. The
stat(2)
system call can be served without calling into
the filesystem if the path argument of the system call corresponds
to an entry of the dentry cache.
When a file is opened, the vfs allocates a file object
(also known as file description and struct file),
and adds a reference to the file object to the calling process' table
of open files. The index to this table is returned as the file
descriptor from the open(2)
system call. The file
object contains a reference to the dentry and to the filesystem
specific methods for the usual operations like read(2)
and write(2)
. Also the file offset and the file status
flags (O_NONBLOCK, O_SYNC, etc.
) are recorded in the
file object.
Like dentries, file objects are reference-counted. Once the
counter hits zero, the file object is freed. System calls like
dup(2)
and fork(2)
increase the reference
count while close(2)
and exit(2)
decrease
it. However, not all file object references correspond to file
descriptors, since also the kernel itself can hold references to a
file object. For example, the loop device driver and the ecryptfs
stacked filesystem increase the reference counter of the file objects
they work with. Passing a file descriptor from one process to another
via local sockets is another situation where the reference counters
of the affected file object need to be adjusted.
When an application calls dup(2)
to duplicate a
file descriptor, the two copies refer to the same file object and
therefore share the file offset and the status flags of the file
object. An open(2)
call, however, creates a new file
object even if the file is already open.
Another job of the vfs is to keep track of the tree of all mounts and their properties. Do do so, the vfs maintains a tree of mount structures. Whenever a filesystem is mounted, one such structure is allocated and linked into the tree. Among other information, the mount structure contains various pointers, including pointers to
MS_RDONLY, MS_NOEXEC, MS_NOATIME
,
etc.). Since a filesystem can be bind-mounted, there can be several mount structures whose superblock pointers point to the same superblock. The superblock structure contains the UUID, quota information, granularity of timestamps, and much more.
strace -e%file ls
to see the
(negative) dentry cache in action. df(1)
command
needs to perform to do its work. Confirm by running strace
df
. mountpoint(1)
command determines if a given
directory is the mountpoint of a filesystem. Explain how the command
is implemented. open(2)
,
then forks so that the child process inherits the file descriptor.
Assume further that the child calls lseek(2)
to reposition
the file offset. Will the parent process see the modified offset? fsck(8)
on a filesystem which has
been lazily unmounted (by executing mount -l
)? /etc/fstab
? Figure out how to print
the UUID of an ext4 and an xfs filesystem. dd(8)
, the two UUIDs match. The same is true if the block
device was snapshotted with LVM (regardless of whether thin or regular
snapshots were used). Discuss the consequences of this fact. umount -a
command unmounts all filesystems which
are not busy. Describe an algorithm which walks the mount tree to
implement the command, taking into account that a mount is busy if
it has at least one child mount. When the first versions of Linux were released in 1991, the kernel relied on the minix filesystem whose source code could be used freely in education. However, this filesystem was designed for education rather than for real use and had severe limitations. For example it supported only file names up to 14 characters long, and a total size of 64M.
In 1992 the extended filesystem (ext) was released as the first filesystem that was created specifically for Linux. It already made use of the VFS API that was introduced at the same time. In 1993, ext was superseded by ext2, and later by ext3 (2001) and ext4 (2008). While ext2 is a separate filesystem, the ext3 and ext4 filesystems share the same implementation.
Over time, many new features were added to ext3 and later to ext4. For example, journaling was one of the main features that ext3 added on top of ext2 while delayed allocation and on-disk data structure checksumming came with ext4. In the remainder of this section we look at the way the ext* filesystems lay out their data and metadata blocks and describe some of the features of ext3 and ext4.
All ext* filesystems lay out the space of the underlying block device in a traditional way, inspired by the original Unix filesystem, ufs. The available blocks are partitioned into block groups. Each block group is typically 128M large and always contains its own set of tables and bitmaps that manage the blocks of the group. If feasible, the data blocks of each file are kept in the same block group as its inode and the containing directory to avoid unnecessary seeks. As of ext4, block groups can be combined to form larger, so-called flexible block groups to improve metadata locality and to have large files laid out sequentially.
Inodes are referenced in the inode table, and a bitmap keeps track of allocated and unallocated inodes. A copy of the filesystem superblock is stored in several other block groups since the superblock is critical for the integrity of the filesystem.
The directory entries of an ext or ext2 filesystem are laid out in the traditional way. That is, the directory names and associated information like file type and the inode number are listed in the data blocks that correspond to the directory. Directories can be imagined as a 3-column table where each line contains an inode number, a file type number, and the path component. Since searching a linear array performs poorly, ext3 implemented B-trees to speed up name lookups in large directories. The nodes of the tree are keyed by the hashes of the names of the directory entries.
If journaling is enabled, metadata updates are always journaled
(i.e., a log record is written to the journal first). However, this
is not always the case for data blocks. The data
mount
option for ext3 and ext4 specifies how writeout of data blocks works.
The following three journaling modes offer different trade-offs
between speed and data integrity.
data=journal
), but the update of the
corresponding metadata blocks is deferred until all data blocks have
been written. The ext2 and ext3 filesystems employ the traditional indirect block scheme, which is basically a table of block numbers which map to the blocks that comprise the contents of the file. One feature of ext4 are extent trees, which are a more efficient data structure to describe this mapping because file contents are often laid out sequentially on the underlying block device. In this case, if the file is large it saves metadata space if the block numbers are not stored as a long list but as extents, that is, ranges of successive blocks. Extents have to be organized in a tree to quickly map a given file offset to the block number that contains the file contents at this offset. The first few extents of the extent tree, including its root, are stored in the inode itself. Only files which need more extents require extra metadata blocks for the nodes of the extent tree.
Filesystems often need to be grown, but sometimes it is also handy to shrink a filesystem, for example to redistribute the storage between the logical volumes of a volume group of fixed size. A feature which distinguishes ext* from xfs is that ext2, ext3 and ext4 filesystems can be shrunk while xfs filesystems can only be grown. However, to shrink an ext* filesystem, the filesystem must be offline. That's in contrast to online growing, which is supported for both ext* and xfs.
To shrink a filesystem which is stored on a logical volume,
one needs to convey the new size to both the device mapper and the
filesystem. It is usually a good idea to run fsck(8)
after
the filesystem has been unmounted and before resize2fs
is run to shrink it. After the filesystem has been shrunk, the next
step is to shrink also the underlying LV. Of course the new size of
the LV must not be smaller than the size of the shrunk filesystem. To
avoid this from happening, for example due to rounding errors,
it's best to make the LV slightly larger than necessary and then
enlarge the filesystem to the maximal possible size by running
resize2fs(8)
without specifying the new size.
dump2fs
-h <device>
. /sys/fs/ext4
.
foo
even if the filesystem is mounted with
data=ordered
(the default journaling mode). How does
the auto_da_alloc
mount option of ext4 try to mitigate
the chance of data loss? xfs_repair(8)
.
Project quotas used to be an xfs feature, but the functionality has been made generic and is therefore available to other filesystems as well. Besides xfs, also ext4 supports project quotas.
To limit the size of an arbitrary subtree, a special inode flag,
XFS_DIFLAG_PROJINHERIT
, is used. This flag indicates
that the directory and all inodes created in the directory inherit
the project ID of the directory. Hence the act of creating a file
in a XFS_DIFLAG_PROJINHERIT
marked directory associates
the new file with s a specific project ID. New directories also get
marked with XFS_DIFLAG_PROJINHERIT
so the behaviour is
propagated down the directory tree.
Project quota is accounted for when moving into an accounted directory tree, but not when moving out of a directory tree into an unaccounted location. Moreover, one can create hard links to an accounted file in an uncontrolled destination (as the inode is still accounted). But it is not allowed to link from an accounted directory into a destination with a different project ID.
Project IDs may be mapped to names through the
/etc/projid
and /etc/projects
configuration
files.
As files are being written, xfs allocates extra blocks beyond the current end of file, anticipating that further writes will arrive to extend the file. The preallocation size is dynamic and depends mainly on the size of the file. When the file is closed, the unneeded extra blocks are reclaimed.
The speculatively preallocated post-EOF blocks help to minimize file fragmentation, but they can cause confusion because they are accounted identically to other blocks, making files appear to use more data blocks than expected.
If the system crashes while preallocated post-EOF blocks exist, the space will be recovered the next time the affected file gets closed (after it has been opened of course) by the normal reclaim mechanism which happens when a file is being closed.
This feature was implemented in 2018. It adds yet another B-tree to the xfs on-disk data structures: the reverse mapping tree, which allows the filesystem to look up the owner of a given block (if any). For example, if the underlying storage device reports that a certain block went bad, and that block happens to contain contents of a regular file, the reverse mapping tree yields the corresponding inode and the file offset.
Another use of the reverse mapping tree is filesystem scrubbing, a data integrity technique where a kernel thread runs in the background to check the on-disk data structures for consistency while the filesystem is mounted.
Since reverse mapping imposes some performance overhead, the feature is disabled by default.
mkfs(8)
.
mkfs.xfs(8)
tries to figure out the
characteristics of the block device and pick a suitable number of
AGs. However, in some cases, for example if the filesystem is stored
on a hardware raid array or on a bcache device, it can not detect the
geometry of the underlying storage. Explicitly specifying the number
of AGs is probably a good idea in this case. As a rule of thumb,
32 AGs should be enough for most cases.
The number of AGs can not be changed after the filesystem has been created. However, when an xfs filesystem is grown, new AGs are added, and care should be taken to align the device size to a multiple of the AG size. Generally, one should not increase the filesystem many times in small steps.
mkfs.xfs(8)
tries
hard to choose reasonable defaults, but for hardware raid arrays,
the command has no way to tell the raid level and the number of
data disks in the array. Without this number it is beneficial to
specify the sunit
and swidth
options to
mkfs.xfs(8)
. df -h $P
and df -hi $P
, where
P
is a path on which project quotas are enforced. ln
command. xfs_bmap -v file
for a large file on an xfs
filesystem to see the extents of the file. xfs_logprint -t
to dump the log records of a
busy xfs filesystem. xfs_info(8)
on an xfs mountpoint and examine
the values shown in the data section.xfs_info(8)
on an existing xfs filesystem and
determine whether the device size is an integer multiple of the AG
size. Discuss the relevance of this property. lvcreate
and
mkfs.xfs(8)
commands to achieve this. xfs_info(8)
and df(1)
.
The nfs protocols are built on top of a concept called remote
procedure call (rpc), which is based on an encoding format known
as external data representation (xdr). The rpcs which are
provided by the nfs server are closely related to filesystem-specific
system calls like read(2), write(2), link(2), rename(2),
mkdir(2)
etc. Therefore an introduction to nfs naturally starts
with rpc and xdr.
The functionality of a network service can often be divided into
the low-level part and the application-level part. The low-level
part talks to the kernel to establish the connection and to send
and receive data, using system calls like socket(2), bind(2),
listen(2), connect(2), recv(2), send(2)
, etc. This part is
independent of the application layer which is only concerned with
the network protocol of the service. For a service like nfs which
combines more than one network protocol, it makes sense to abstract
out the common low-level part. The rpc framework was designed in 1976
to provide such an abstraction. It supports a variety of transports
including tcp and udp. With rpc, a program running on one computer can
execute a function on a different computer. The functions that can
be called in this manner, the rpc services, are identified
by a program number and the version number. Originally developed by
Sun Microsystems in the 1980s, rpc is still in use today, sometimes
still under the old "sunrpc" name.
In general, the called procedure runs on a different system as the calling procedure, so the client and server processes don't share the same address space, and no memory references can be passed. Instead, data structures must be serialized first, i.e. converted to a certain transfer format that can be stored in a single memory buffer, the xdr buffer, which is then sent to the server. The received xdr buffer is de-serialized (decoded) by the server, possibly in a different way. For example, the server might store the bytes which describe an integer value in a different order than the client to meet the requirements of its CPU (little/big endian). The xdr API offers routines to convert many predefined data types (int, string, etc.) to an xdr buffer or vice versa. This unburdens the programmer from such details as much as possible.
To activate rpc on a system, the rpcbind(8)
daemon
(formerly known as portmapper) must be running. This daemon manages
the various procedures employed by nfs such as mount, lock manager,
quota daemon, and the nfs procedure itself. It communicates with
rpc clients by listening on a well-known port. Clients first send a
get_port
request to rpcbind(8)
in order to
find out the port number which corresponds to the procedure they are
interested in. For example, an nfs client which intends to mount an
nfs-exported directory requests the port number of the mount procedure
from rpcbind(8)
. A second request is then made to actually
mount the filesystem. The exercises of this section ask the reader to
run the rpcinfo(8)
tool to show the available procedures
and their port numbers on the specified server.
The input format for rpc is the rpc language (rpcl),
which is similar to C. This format fully describes the application
protocol, including all procedures and data types. RFC 7531 contains
the full xdr description of nfs4 in rpcl. The rpcgen(1)
protocol compiler generates C code from rpcl input. The C code is
then compiled to generate application code which implements the
protocol described by the input. rpcgen(8)
offers
multiple application interfaces which provide different degrees of
control over the rpc internals.
The nfs protocol versions 2 and 3 are stateless, which means that that by design the server does not keep track of what clients do. For example, the server does not remember which files are currently open. Instead, the client tracks open files and the current offset of each open file, translating application requests into suitable protocol messages. While statelessness simplifies crash recovery for both the client and the server, it also has downsides. For example, file locking requires the server to maintain the existing locks and the list of clients which are holding them. Since this can not be done with a stateless protocol, another rpc service, the lock daemon (lockd), was added. To recover the state of locks after a server reboot, yet another rpc service, the status daemon (statd), had to be introduced. This design added complexity for no real gain, which is why nfs4 departed from the previous versions by introducing state. With a stateful protocol it became possible to combine all related rpc services into a single service which uses a single TCP port. This simplifies the implementation and also allows for compound operations where the client sends more than one request in a singe rpc call.
File handles describe the file or directory a particular operation is going to operate upon. For the nfs clients, file handles are opaque blobs that can only be tested for equality, but which can not be interpreted in any way. However, for the nfs server a file handle identifies the corresponding file or directory. Most protocol requests include a file handle. For example, the LOOKUP and MKDIR operations both return a file handle to the nfs client.
A file handle consists of three identifiers: a filesystem ID,
an inode number and the so-called generation number. The
filesystem ID and the inode number also exist for local files. They
are derived from the statvfs
structure that describes
the exported filesystem and the stat
structure of the
inode, respectively. The generation number, however, is only needed
for network file systems. Roughly speaking, the generation number
counts how many times the inode has been re-used. This is necessary to
prevent clients from accessing a file through an existing file handle
if the file was deleted on the server and its inode number has been
re-used subsequently. File handles are based on leases:
The client periodically talks to the server to update its leases.
There is a deep interaction between file handles and the dentry cache of the vfs. Without nfs, a filesystem can rely on the following "closure" property: For any positive dentry, all its parent directories are also positive dentries. This is no longer true if a filesystem is exported. Therefore the filesystem maps any file handles sent to nfs clients to disconnected dentries. Any process whose cwd is on a local fs contributes to the reference counter of the dentry that corresponds to the directory, and thus prevents the filesystem from being unmounted. For nfs, this is not possible. More general: remote applications need a way to refer to a particular dentry, stable across renames, truncates, and server-reboot.
Several rpcs return file attributes, i.e., the inode
information which is available for local filesystems through the
stat(2)
system call. For example, the LOOKUP
rpc returns a new file handle and the attributes that are associated
with the given path, and the GETATTR
rpc returns the
current attributes of the file which corresponds to an existing
file handle. By default, nfs clients cache these metadata. However,
since metadata can change at any time due to file operations from other
clients, the cached information can become stale. Therefore attributes
are cached for only a few seconds, which is thus the duration of the
time window during which metadata modifications caused on a different
nfs client remain undetected. Reducing this time window can result in
flooding the server with GETATTR
requests while extending
it increases the chance of returning stale cached data or metadata
to the application. With the noac
mount option, the
client asks the server every time it needs to assess file metadata.
However, the option also prohibits data caching, just like
the sync
option. This severely impacts performance.
Changes to directories are handled similarly. To detect when directory entries have been added or removed on the server, the client watches the directory mtime (nfsv2 and nfsv3) or change attribute (nfsv4). When the client detects a change, it drops all cached attributes for that directory. Since the directory's mtime and the change attributes are cached attributes, it may take some time before a client notices directory changes.
nfs clients are usually allowed to cache write operations because the write caches increase client performance significantly and reduce the load of the server at the same time, allowing it to support more clients. However, one side effect of write caching is that other clients which access the same file at the same time will not see the changes immediately. The consistency guarantees of a network file system describe the semantics of such concurrent file operations.
The nfs versions 2 and 3 provide weak cache consistency which notifies clients about changes made by other clients before and after an rpc. This concept turned out to be problematic, so nfsv4 replaced weak cache consistency by close-to-open cache consistency, which means that an nfs client is only guaranteed to see the effects of another client's write operation if it opens the file after the client that wrote to the file has closed it.
To illustrate close-to-open cache consistency, consider the scenario illustrated in the diagram on the left where three nfs clients (as indicated by colors) access the same file. The blue client opens the file and writes to it while the other two clients only perform read operations. With close-to-open cache consistency the green client is guaranteed to see the write operation of the blue client while there is no such guarantee for the red client.
nfs4 introduced a per-file state management feature called file delegation. Once a file has been delegated to a client, the server blocks write access to the file for other nfs clients and for local processes. Therefore the client may assume that the file does not change unexpectedly. This cache-coherency guarantee can improve performance because the client may cache all write operations for this file, and only contact the server when memory pressure forces the client to free memory by writing back file contents.
A drawback of file delegations is that they delay conflicting open
requests by other clients because existing delegations must be recalled
before the open request completes. This is particularly important if
an nfs client which is holding a delegation gets disconnected from
the network. To detect this condition, clients report to the server
that they are still alive by periodically sending a RENEW
request. If no such request has arrived for the lease time
(typically 90 seconds), the server may recall any delegations it
has granted to the disconnected client. This allows accesses from
other clients that would normally be prevented because of the
delegation.
However, the server is not obliged to recall uncontested delegations for clients whose lease period has expired. In fact, newer Linux NFS server implementations retain the uncontested state of unresponsive clients for up to 24 hours. This so-called courteous server feature was introduced in Linux-5.19 (released in 2022).
Let us finally remark that the delegations as discussed above work only for regular files. NFS versions up to and including 4.0 do not grant delegations for directories. With nfs4.1 an nfs client may ask the server to be notified whenever changes are made to the directory by another client. Among other benefits, this feature allows for strong directory cache coherency. However, as of 2022, directory delegations are not yet implemented by Linux.
Many applications employ the following old trick to store temporary
data without leaving a stale temp file behind in case the process
crashes or is killed with SIGKILL
. They create and open
a temporary file, then call unlink(2)
to disassociate
the path from the filesystem tree while retaining the file descriptor
for subsequent I/O operations.
With NFS this does not work because the file descriptor exists
only on the client, and the server doesn't know about it. Consequently
the normal unlink(2)
call on the server would delete
the file and free its data blocks. This is why the nfs client just
renames the file to something like .nfs12345
if an application calls unlink(2)
to remove it while it
is still open. Only after all the last file descriptor that refers
to the thusly silly-renamed file is closed, the client removes the
file by issuing an appropriate rpc.
This approach is not perfect. For one, if the client crashes, a
stale .nfs12345
file remains on the server. Second, since
silly renames are only known to the nfs client, bad things happen if a
different client removes the file. Finally, if an application running
on a client removes the last regular file in a directory, and this
file got silly-renamed because it was still held open, a subsequent
rmdir
will fail unexpectedly with Directory not
empty
. Version 4.1 of the NFS protocol finally got rid of
silly renames: An NFS4.1 server knows when it its safe to unlink a
file and communicates this information to the client.
The file handle which an nfs client received through some earlier
rpc can become invalid at any time due to operations on different
hosts. This happens, for example, if the file was deleted on the server
or on a different nfs client, or when the directory that contains
the file is no longer exported by the server due to a configuration
change. Subsequent attempts to use this file handle for rpcs then
fail with the ESTALE
error.
The exercises below ask the reader to cause silly-renamed files, and stale file handles.
nfs(5)
for details. We only cover a couple of the more interesting ones with
respect to performance.
soft/hard
. hard: nfs requests are retried
indefinitely. Soft: requests are failed eventually. sync/async
. async: nfs client delays sending
application writes to the server. sync: writes cause data to be
flushed to the server before the system call returns. df -h
, mount -t nfs -t nfs4
. /usr/sbin/rpcinfo -b 100003 3 | awk '{print $2}' |
sort | uniq
to list all NFS servers. sync
mount option
for nfs and the sync
export option of nfsd. /sbin/showmount -e
, rpcinfo
. collectl -s F -i 5
and discuss
the output. cat > foo &
. Note that the cat process automatically
receives the STOP signal. Run rm foo; ls -ltra
. Read
section D2 of the nfs HOWTO
for the explanation. { while :; do echo; sleep
1; done; } > baz &
. What happens if you remove the file on a
different nfs client? /proc/loadavg
). /proc/loadavg
as a number of
suitable type. nfsd
and the
rpc_pipefs
pseudo filesystems. Hint: See
Documentation/filesystems/nfs/nfs.txt
of the Linux
source code.
fd = open("foo.new", ...); write(fd, "new content", ...); close(fd); rename("foo.new", "foo");
Documentation/filesystems/vfs.txt
of the Linux
kernel source.