LVM

Who the heck is General Failure, and why is he reading my disk? -- Unknown

Chapter	Section
Introduction Unix Concepts Networking LVM Filesystems OS-Level Virtualization	The Linux Block Layer Physical and Logical Volumes, Volume Groups Device Mapper and Device Mapper Targets LVM Snapshots Thin Provisioning Bcache, dm-cache and dm-writecache The dm-crypt Target Supplements

Overview The idea of Logical Volume Management is to decouple data and storage. This offers great flexibility in managing storage and reduces server downtimes because the storage may be replaced while file systems are mounted read-write and applications are actively using them. This chapter provides an introduction to the Linux block layer and LVM. Subsequent sections cover selected device mapper targets.

The Linux Block Layer

The main task of LVM is the management of block devices, so it is natural to start an introduction to LVM with a section on the Linux block layer, which is the central component in the Linux kernel for the handling of persistent storage devices. The mission of the block layer is to provide a uniform interface to different types of storage devices. The obvious in-kernel users of this interface are the file systems and the swap subsystem. But also stacking device drivers like LVM, Bcache and MD access block devices through this interface to create virtual block devices from other block devices. Some user space programs (fdisk, dd, mkfs, ...) also need to access block devices. The block layer allows them to perform their task in a well-defined and uniform manner through block-special device files.

The userspace programs and the in-kernel users interact with the block layer by sending read or write requests. A bio is the central data structure that carries such requests within the kernel. Bios may contain an arbitrary amount of data. They are given to the block layer to be queued for subsequent handling. Often a bio has to travel through a stack of block device drivers where each driver modifies the bio and sends it on to the next driver. Typically, only the last driver in the stack corresponds to a hardware device.

Besides requests to read or write data blocks, there are various other bio requests that carry SCSI commands like FLUSH, FUA (Force Unit Access), TRIM and UNMAP. FLUSH and FUA ensure that certain data hits stable storage. FLUSH asks the the device to write out the contents of its volatile write cache while a FUA request carries data that should be written directly to the device, bypassing all caches. UNMAP/TRIM is a SCSI/ATA command which is only relevant to SSDs. It is a promise of the OS to not read the given range of blocks any more, so the device is free to discard the contents and return arbitrary data on the next read. This helps the device to level out the number of times the flash storage cells are overwritten (wear-leveling), which improves the durability of the device.

The first task of the block layer is to split incoming bios if necessary to make them conform to the size limit or the alignment requirements of the target device, and to batch and merge bios so that they can be submitted as a unit for performance reasons. The thusly processed bios then form an I/O request which is handed to an I/O scheduler (also known as elevator).

Traditionally, the schedulers were designed for rotating disks. They implemented a single request queue and reordered the queued I/O requests with the aim to minimize disk seek times. The newer multi-queue schedulers mq-deadline, kyber, and bfq (budget fair queueing) aim to max out even the fastest devices. As implied by the name "multi-queue", they implement several request queues, the number of which depends on the hardware in use. This has become necessary because modern storage hardware allows multiple requests to be submitted in parallel from different CPUs. Moreover, with many CPUs the locking overhead required to put a request into a queue increases. Per-CPU queues allow for per-CPU locks, which decreases queue lock contention.

We will take a look at some aspects of the Linux block layer and on the various I/O schedulers. An exercise on loop devices enables the reader to create block devices for testing. This will be handy in the subsequent sections on LVM specific topics.

Exercises

Run find /dev -type b to get the list of all block devices on your system. Explain which is which.
Examine the files in /sys/block/sda, in particular /sys/block/sda/stat. Search the web for Documentation/block/stat.txt for the meaning of the numbers shown. Then run iostat -xdh sda 1.
Examine the files in /sys/block/sda/queue.
Find out how to determine the size of a block device.
Figure out a way to identify the name of all block devices which correspond to SSDs (i.e., excluding any rotating disks).
Run lsblk and discuss the output. Too easy? Run lsblk -o KNAME,PHY-SEC,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,RQ-SIZE,ROTA,SCHED
What's the difference between a task scheduler and an I/O scheduler?
Why are I/O schedulers also called elevators?
How can one find out which I/O schedulers are supported on a system and which scheduler is active for a given block device?
Is it possible (and safe) to change the I/O scheduler for a block device while it is in use? If so, how can this be done?
The loop device driver of the Linux kernel allows privileged users to create a block device from a regular file stored on a file system. The resulting block device is called a loop device. Create a 1G large temporary file containing only zeroes. Run a suitable losetup(8) command to create a loop device from the file. Create an XFS file system on the loop device and mount it.

Homework

Come up with three different use cases for loop devices.
Given a block device node in /dev, how can one tell that it is a loop device?
Describe the connection between loop devices created by losetup(8) and the loopback device used for network connections from the machine to itself.

Physical and Logical Volumes, Volume Groups

Getting started with the Logical Volume Manager (LVM) requires to get used to a minimal set of vocabulary. This section introduces the words named in the title of the section, and a couple more. The basic concepts of LVM are then described in terms of these words.

A Physical Volume (PV, grey) is an arbitrary block device which contains a certain metadata header (also known as superblock) at the start. PVs can be partitions on a local hard disk or a SSD, a soft- or hardware raid, or a loop device. LVM does not care. The storage space on a physical volume is managed in units called Physical Extents (PEs, yellow). The default PE size is 4M.

A Volume Group (VG, green) is a non-empty set of PVs with a name and a unique ID assigned to it. A PV can but doesn't need to be assigned to a VG. If it is, the ID of the associated VG is stored in the metadata header of the PV.

A Logical Volume (LV, blue) is a named block device which is provided by LVM. LVs are always associated with a VG and are stored on that VG's PVs. Since LVs are normal block devices, file systems of any type can be created on them, they can be used as swap storage, etc. The chunks of a LV are managed as Logical Extents (LEs, orange). Often the LE size equals the PE size. For each LV there is a mapping between the LEs of the LV and the PEs of the underlying PVs. The PEs can spread multiple PVs.

VGs can be extended by adding additional PVs to it, or reduced by removing unused devices, i.e., those with no PEs allocated on them. PEs may be moved from one PV to another while the LVs are active. LVs may be grown or shrunk. To grow a LV, there must be enough space left in the VG. Growing a LV does not magically grow the file system stored on it, however. To make use of the additional space, a second, file system specific step is needed to tell the file system that it's underlying block device (the LV) has grown.

The exercises of this section illustrate the basic LVM concepts and the essential LVM commands. They ask the reader to create a VG whose PVs are loop devices. This VG is used as a starting point in subsequent chapters.

Exercises

Create two 5G large loop devices /dev/loop1 and /dev/loop2. Make them PVs by running pvcreate. Create a VG tvg (test volume group) from the two loop devices and two 3G large LVs named tlv1 and tlv2 on it. Run the pvcreate, vgcreate, and lvcreate commands with -v to activate verbose output and try to understand each output line.
Run pvs, vgs, lvs, lvdisplay, pvdisplay and examine the output.
Run lvdisplay -m to examine the mapping of logical extents to PVs and physical extents.
Run pvs --segments -o+lv_name,seg_start_pe,segtype to see the map between physical extents and logical extents.

Homework

In the above scenario (two LVs in a VG consisting of two PVs), how can you tell whether both PVs are actually used? Remove the LVs with lvremove. Recreate them, but this time use the --stripes 2 option to lvcreate. Explain what this option does and confirm with a suitable command.

Device Mapper and Device Mapper Targets

The kernel part of the Logical Volume Manager (LVM) is called device mapper (DM), which is a generic framework to map one block device to another. Applications talk to the Device Mapper via the libdevmapper library, which issues requests to the /dev/mapper/control character device using the ioctl(2) system call. The device mapper is also accessible from scripts via the dmsetup(8) tool.

A DM target represents one particular mapping type for ranges of LEs. Several DM targets exist, each of which which creates and maintains block devices with certain characteristics. In this section we take a look at the dmsetup tool and the relatively simple mirror target. Subsequent sections cover other targets in more detail.

Exercises

Run dmsetup targets to list all targets supported by the currently running kernel. Explain their purpose and typical use cases.
Starting with the tvg VG, remove tlv2. Convince yourself by running vgs that tvg is 10G large, with 3G being in use. Run pvmove /dev/loop1 to move the used PEs of /dev/loop1 to /dev/loop2. After the command completes, run pvs again to see that /dev/loop1 has no more PEs in use.
Create a third 5G loop device /dev/loop3, make it a PV and extend the VG with vgextend tvg /dev/loop3. Remove tlv1. Now the LEs of tlv2 fit on any of the three PVs. Come up with a command which moves them to /dev/loop3.
The first two loop devices are both unused. Remove them from the VG with vgreduce -a. Why are they still listed in the pvs output? What can be done about that?

Homework

As advertised in the introduction, LVM allows the administrator to replace the underlying storage of a file system online. This is done by running a suitable pvmove(8) command to move all PEs of one PV to different PVs in the same VG.

Explain the mapping type of dm-mirror.
The traditional way to mirror the contents of two or more block devices is software raid 1, also known as md raid1 ("md" is short for multi-disk). Explain the difference between md raid1, the dm-raid target which supports raid1 and other raid levels, and the dm-mirror target.
Guess how pvmove is implemented on top of dm-mirror. Verify your guess by reading the "NOTES" section of the pvmove(8) man page.

LVM Snapshots

LVM snapshots are based on the CoW optimization strategy described earlier in the chapter on Unix Concepts. Creating a snapshot means to create a CoW table of the given size. Just before a LE of a snapshotted LV is about to be written to, its contents are copied to a free slot in the CoW table. This preserves an old version of the LV, the snapshot, which can later be reconstructed by overlaying the CoW table atop the LV.

Snapshots can be taken from a LV which contains a mounted file system, while applications are actively modifying files. Without coordination between the file system and LVM, the file system most likely has memory buffers scheduled for writeback. These outstanding writes did not make it to the snapshot, so one can not expect the snapshot to contain a consistent file system image. Instead, it is in a similar state as a regular device after an unclean shutdown. This is not a problem for XFS and EXT4, as both are journalling file systems, which were designed with crash recovery in mind. At the next mount after a crash, journalling file systems replay their journal, which results in a consistent state. Note that this implies that even a read-only mount of the snapshot device has to write to the device.

Exercises

In the test VG, create a 1G large snapshot named snap_tlv1 of the tlv1 VG by using the -s option to lvcreate(8). Predict how much free space is left in the VG. Confirm with vgs tvg.
Create an EXT4 file system on tlv1 by running mkfs.ext4 /dev/tvg/lv1. Guess how much of the snapshot space has been allocated by this operation. Check with lvs tvg1/snap_lv1.
Remove the snapshot with lvremove and recreate it. Repeat the previous step, but this time run mkfs.xfs to create an XFS file system. Run lvs tvg/snap_lv1 again and compare the used snapshot space to the EXT4 case. Explain the difference.
Remove the snapshot and recreate it so that both tlv1 and snap_tlv1 contain a valid XFS file system. Mount the file systems on /mnt/1 and /mnt/2.
Run dd if=/dev/zero of=/mnt/1/zero count=$((2 * 100 * 1024)) to create a 100M large file on tlv1. Check that /mnt/2 is still empty. Estimate how much of the snapshot space is used and check again.
Repeat the above dd command 5 times and run lvs again. Explain why the used snapshot space did not increase.
It is possible to create snapshots of snapshots. This is implemented by chaining together CoW tables. Describe the impact on performance.
Suppose a snapshot was created before significant modifications were made to the contents of the LV, for example an upgrade of a large software package. Assume that the user wishes to permanently return to the old version because the upgrade did not work out. In this scenario it is the snapshot which needs to be retained, rather than the original LV. In view of this scenario, guess what happens on the attempt to remove a LV which is being snapshotted. Unmount /mnt/1 and confirm by running lvremove tvg/lv1.
Come up with a suitable lvconvert command which replaces the role of the LV and its snapshot. Explain why this solves the "bad upgrade" problem outlined above.
Explain what happens if the CoW table fills up. Confirm by writing a file larger than the snapshot size.

Thin Provisioning

The term "thin provisioning" is just a modern buzzword for over-subscription. Both terms mean to give the appearance of having more resources than are actually available. This is achieved by on-demand allocation. The thin provisioning implementation of Linux is implemented as a DM target called dm-thin. This code first made its appearance in 2011 and was declared as stable two years later. These days it should be safe for production use.

The general problem with thin provisioning is of course that bad things happen when the resources are exhausted because the demand has increased before new resources were added. For dm-thin this can happen when users write to their allotted space, causing dm-thin to attempt allocating a data block from a volume which is already full. This usually leads to severe data corruption because file systems are not really prepared to handle this error case and treat it as if the underlying block device had failed. dm-thin does nothing to prevent this, but one can configure a low watermark. When the number of free data blocks drops below the watermark, a so-called dm-event will be generated to notice the administrator.

One highlight of dm-thin is its efficient support for an arbitrary depth of recursive snapshots, called dm-thin snapshots in this document. With the traditional snapshot implementation, recursive snapshots quickly become a performance issue as the depth increases. With dm-thin one can have an arbitrary subset of all snapshots active at any point in time, and there is no ordering requirement on activating or removing them.

The block devices created by dm-thin always belong to a thin pool which ties together two LVs called the metadata LV and the data LV. The combined LV is called the thin pool LV. Setting up a VG for thin provisioning is done in two steps: First the standard LVs for data and the metatdata are created. Second, the two LVs are combined into a thin pool LV. The second step hides the two underlying LVs so that only the combined thin pool LV is visible afterwards. Thin provisioned LVs and dm-thin snapshots can then be created from the thin pool LV with a single command.

Another nice feature of dm-thin are external snapshots. An external snapshot is one where the origin for a thinly provisioned device is not a device of the pool. Arbitrary read-only block devices can be turned into writable devices by creating an external snapshot. Reads to an unprovisioned area of the snapshot will be passed through to the origin. Writes trigger the allocation of new blocks as usual with CoW. One use case for this is VM hosts which run their VMs on thinly-provisioned volumes but have the base image on some "master" device which is read-only and can hence be shared between all VMs.

Exercises

Starting with the tvg VG, create and test a thin pool LV by performing the following steps. The "Thin Usage" section of lvmthin(7) will be helpful.

Remove the tlv1 and tlv2 LVs.
Create a 5G data LV named tdlv (thin data LV) and a 500M LV named tmdlv (thin metada LV).
Combine the two LVs into a thin pool with lvconvert. Run lvs -a and explain the flags listed below Attr.
Create a 10G thin LV named oslv (over-subscribed LV).
Create an XFS file system on oslv and mount it on /mnt.
Run a loop of the form for ((i = 0; i < 50; i++)): do ... ; done so that each iteration creates a 50M file named file-$i and a snapshot named snap_oslv-$i of oslv.
Activate an arbitrary snapshot with lvchange -K and try to mount it. Explain what the error message means. Then read the "XFS on snapshots" section of lvmthin(7).
Check the available space of the data LV with lvs -a. Mount one snapshot (specifying -o nouuid) and run lvs -a again. Why did the free space decrease although no new files were written?
Mount four different snapshots and check that they contain the expected files.
Remove all snapshots. Guess what lvs -a and dh -h /mnt report. Then run the commands to confirm. Guess what happens if you try to create another 3G file? Confirm your guess, then read the section on "Data space exhaustion" of lvmthin(7).

Homework

When a thin pool provisions a new data block for a thin LV, the new block is first overwritten with zeros by default. Discuss why this is done, its impact on performance and security, and conclude whether or not it is a good idea to turn off the zeroing.

Bcache, dm-cache and dm-writecache

All three implementations named in the title of this chapter are Linux block layer caches. They combine two different block devices to form a hybrid block device which dynamically caches and migrates data between the two devices with the aim to improve performance. One device, the backing device, is expected to be large and slow while the other one, the cache device, is expected to be small and fast.

The most simple setup consists of a single rotating disk and one SSD. The setup shown in the diagram at the left is realistic for a large server with redundant storage. In this setup the hybrid device (yellow) combines a raid6 array (green) consisting of many rotating disks (grey) with a two-disk raid1 array (orange) stored on fast NVMe devices (blue). In the simple setup it is always a win when I/O is performed from/to the SSD instead of the rotating disk. In the server setup, however, it depends on the workload which device is faster. Given enough rotating disks and a streaming I/O workload, the raid6 outperforms the raid1 because all disks can read or write at full speed.

Since block layer caches hook into the Linux block API described earlier, the hybrid block devices they provide can be used like any other block device. In particular, the hybrid devices are file system agnostic, meaning that any file system can be created on them. In what follows we briefly describe the differences between the three block layer caches and conclude with the pros and cons of each.

Bcache is a stand-alone stacking device driver which was included in the Linux kernel in 2013. According to the bcache home page, it is "done and stable". dm-cache and dm-writecache are device mapper targets included in 2013 and 2018, respectively, which are both marked as experimental. In contrast to dm-cache, dm-writecache only caches writes while reads are supposed to be cached in RAM. It has been designed for programs like databases which need low commit latency. Both bcache and dm-cache can operate in writeback or writethrough mode while dm-writecache always operates in writeback mode.

The DM-based caches are designed to leave the decision as to what data to migrate (and when) to user space while bcache has this policy built-in. However, at this point only the Stochastic Multiqueue (smq) policy for dm-cache exists, plus a second policy which is only useful for decommissioning the cache device. There are no tunables for dm-cache while all the bells and whistles of bcache can be configured through sysfs files. Another difference is that bcache detects sequential I/O and separates it from random I/O so that large streaming reads and writes bypass the cache and don't push cached randomly accessed data out of the cache.

bcache is the clear winner of this comparison because it is stable, configurable and performs better at least on the server setup described above because it separate random and sequential I/O. The only advantage of dm-cache is its flexibility because cache policies can be switched. But even this remains a theoretical advantage as long as only a single policy for dm-cache exists.

Exercises

Recall the concepts of writeback and writethrough and explain why writeback is faster and writethrough is safer.
Explain how the writearound mode of bcache works and when it should be used.
Setup a bcache device from two loop devices.
Create a file system of a bcache device and mount it. Detach the cache device while the file system is mounted.
Setup a dm-cache device from two loop devices.
Setup a thin pool where the data LV is a dm-cache device.
Explain the point of dm-cache's passthrough mode.

Homework

Explain why small writes to a file system which is stored on a parity raid result in read-modify-write (RMW) updates. Explain why RMW updates are particularly expensive and how raid implementations and block layer caches try to avoid them.

Homework

Recall the concepts of writeback and writethrough. Describe what each mode means for a hardware device and for a bcache/dm-cache device. Explain why writeback is faster and writethrough is safer.

Homework

TRIM and UNMAP are special commands in the ATA/SCSI command sets which inform an SSD that certain data blocks are no longer in use, allowing the SSD to re-use these blocks to increase performance and to reduce wear. Subsequent reads from the trimmed data blocks will not return any meaningful data. For example, the mkfs commands sends this command to discard all blocks of the device. Discuss the implications when mkfs. is run on a device provided by bcache or dm-cache.

The dm-crypt Target

This device mapper target provides encryption of arbitrary block devices by employing the primitives of the crypto API of the Linux kernel. This API provides a uniform interface to a large number of cipher algorithms which have been implemented with performance and security in mind.

The cipher algorithm of choice for the encryption of block devices is the Advanced Encryption Standard (AES), also known as Rijndael, named after the two Belgian cryptographers Rijmen and Daemen who proposed the algorithm in 1999. AES is a symmetric block cipher. That is, a transformation which operates on fixed-length blocks and which is determined by a single key for both encryption and decryption. The underlying algorithm is fairly simple, which makes AES perform well in both hardware and software. Also the key setup time and the memory requirements are excellent. Modern processors of all manufacturers include instructions to perform AES operations in hardware, improving speed and security.

According to the Snowden documents, the NSA has been doing research on breaking AES for a long time without being able to come up with a practical attack for 256 bit keys. Successful attacks invariably target the key management software instead, which is often implemented poorly, trading security for user-friendliness, for example by storing passwords weakly encrypted, or by providing a "feature" which can decrypt the device without knowing the password.

The exercises of this section ask the reader to encrypt a loop device with AES without relying on any third party key management software

Exercises

Discuss the message of this xkcd comic.
How can a hardware implementation of an algorithm like AES improve security? After all, it is the same algorithm that is implemented.
What's the point of the rstream.c program below which writes random data to stdout? Doesn't cat /dev/urandom do the same?
Compile and run rstream.c to create a 10G local file and create the loop device /dev/loop0 from the file.
A table for the dmsetup(8) command is a single line of the form start_sector num_sectors target_type target_args. Determine the correct values for the first three arguments to encrypt /dev/loop0.
The target_args for the dm-crypt target are of the form cipher key iv_offset device offset. To encrypt /dev/loop0 with AES-256, cipher is aes, device is /dev/loop0 and both offsets are zero. Come up with an idea to create a 256 bit key from a passphrase.
The create subcommand of dmsetup(8) creates a device from the given table. Run a command of the form echo "$table" | dmsetup create cryptdev to create the encrypted device /dev/mapper/cryptdev from the loop device.
Create a file system on /dev/mapper/cryptdev, mount it and create the file passphrase containing the string "super-secret" on this file system.
Unmount the cryptdev device and run dmsetup remove cryptdev. Run strings on the loop device and on the underlying file to see if it contains the string super-secret" or passphrase.
Re-create the cryptdev device, but this time use a different (hence invalid) key. Guess what happens and confirm.
Write a script which disables echoing (stty -echo), reads a passphrase from stdin and combines the above steps to create and mount an encrypted device.

Homework

Why is it a good idea to overwrite a block device with random data before it is encrypted?

Homework

The dm-crypt target encrypts whole block devices. An alternative is to encrypt on the file system level. That is, each file is encrypted separately. Discuss the pros and cons of both approaches.

Supplements

Random stream

	
		/* Link with -lcrypto */
		#include <openssl/rand.h>
		#include <stdio.h>
		#include <unistd.h>
		#include <stdio.h>

		int main(int argc, char **argv)
		{
			unsigned char buf[1024 * 1024];

			for (;;) {
				int ret = RAND_bytes(buf, sizeof(buf));

				if (ret <= 0) {
					fprintf(stderr, "RAND_bytes() error\n");
					exit(EXIT_FAILURE);
				}
				ret = write(STDOUT_FILENO, buf, sizeof(buf));
				if (ret < 0) {
					perror("write");
					exit(EXIT_FAILURE);
				}
			}
			return 0;
		}