LVMWho the heck is General Failure, and why is he reading my disk? -- Unknown |
The main task of LVM is the management of block devices, so it is
natural to start an introduction to LVM with a section on the Linux
block layer, which is the central component in the Linux kernel
for the handling of persistent storage devices. The mission of the
block layer is to provide a uniform interface to different types
of storage devices. The obvious in-kernel users of this interface
are the file systems and the swap subsystem. But also stacking
device drivers like LVM, Bcache and MD access block devices
through this interface to create virtual block devices from other block
devices. Some user space programs (fdisk, dd, mkfs, ...
)
also need to access block devices. The block layer allows them to
perform their task in a well-defined and uniform manner through
block-special device files.
The userspace programs and the in-kernel users interact with the block layer by sending read or write requests. A bio is the central data structure that carries such requests within the kernel. Bios may contain an arbitrary amount of data. They are given to the block layer to be queued for subsequent handling. Often a bio has to travel through a stack of block device drivers where each driver modifies the bio and sends it on to the next driver. Typically, only the last driver in the stack corresponds to a hardware device.
Besides requests to read or write data blocks, there are various other bio requests that carry SCSI commands like FLUSH, FUA (Force Unit Access), TRIM and UNMAP. FLUSH and FUA ensure that certain data hits stable storage. FLUSH asks the the device to write out the contents of its volatile write cache while a FUA request carries data that should be written directly to the device, bypassing all caches. UNMAP/TRIM is a SCSI/ATA command which is only relevant to SSDs. It is a promise of the OS to not read the given range of blocks any more, so the device is free to discard the contents and return arbitrary data on the next read. This helps the device to level out the number of times the flash storage cells are overwritten (wear-leveling), which improves the durability of the device.
The first task of the block layer is to split incoming bios if necessary to make them conform to the size limit or the alignment requirements of the target device, and to batch and merge bios so that they can be submitted as a unit for performance reasons. The thusly processed bios then form an I/O request which is handed to an I/O scheduler (also known as elevator).
Traditionally, the schedulers were designed for rotating disks. They implemented a single request queue and reordered the queued I/O requests with the aim to minimize disk seek times. The newer multi-queue schedulers mq-deadline, kyber, and bfq (budget fair queueing) aim to max out even the fastest devices. As implied by the name "multi-queue", they implement several request queues, the number of which depends on the hardware in use. This has become necessary because modern storage hardware allows multiple requests to be submitted in parallel from different CPUs. Moreover, with many CPUs the locking overhead required to put a request into a queue increases. Per-CPU queues allow for per-CPU locks, which decreases queue lock contention.
We will take a look at some aspects of the Linux block layer and on the various I/O schedulers. An exercise on loop devices enables the reader to create block devices for testing. This will be handy in the subsequent sections on LVM specific topics.
find /dev -type b
to get the list of all block
devices on your system. Explain which is which. /sys/block/sda
, in
particular /sys/block/sda/stat
. Search the web for
Documentation/block/stat.txt
for the meaning of the
numbers shown. Then run iostat -xdh sda 1
. /sys/block/sda/queue
. lsblk
and discuss
the output. Too easy? Run lsblk -o
KNAME,PHY-SEC,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,RQ-SIZE,ROTA,SCHED
losetup(8)
command to create a loop device from the
file. Create an XFS file system on the loop device and mount it. /dev
, how can one
tell that it is a loop device? losetup(8)
and the loopback device used for network
connections from the machine to itself. Getting started with the Logical Volume Manager (LVM) requires to get used to a minimal set of vocabulary. This section introduces the words named in the title of the section, and a couple more. The basic concepts of LVM are then described in terms of these words.
A Physical Volume (PV, grey) is an arbitrary block device which contains a certain metadata header (also known as superblock) at the start. PVs can be partitions on a local hard disk or a SSD, a soft- or hardware raid, or a loop device. LVM does not care. The storage space on a physical volume is managed in units called Physical Extents (PEs, yellow). The default PE size is 4M.
A Volume Group (VG, green) is a non-empty set of PVs with a name and a unique ID assigned to it. A PV can but doesn't need to be assigned to a VG. If it is, the ID of the associated VG is stored in the metadata header of the PV.
A Logical Volume (LV, blue) is a named block device which is provided by LVM. LVs are always associated with a VG and are stored on that VG's PVs. Since LVs are normal block devices, file systems of any type can be created on them, they can be used as swap storage, etc. The chunks of a LV are managed as Logical Extents (LEs, orange). Often the LE size equals the PE size. For each LV there is a mapping between the LEs of the LV and the PEs of the underlying PVs. The PEs can spread multiple PVs.
VGs can be extended by adding additional PVs to it, or reduced by removing unused devices, i.e., those with no PEs allocated on them. PEs may be moved from one PV to another while the LVs are active. LVs may be grown or shrunk. To grow a LV, there must be enough space left in the VG. Growing a LV does not magically grow the file system stored on it, however. To make use of the additional space, a second, file system specific step is needed to tell the file system that it's underlying block device (the LV) has grown.
The exercises of this section illustrate the basic LVM concepts and the essential LVM commands. They ask the reader to create a VG whose PVs are loop devices. This VG is used as a starting point in subsequent chapters.
/dev/loop1
and /dev/loop2
. Make them PVs by running
pvcreate
. Create a VG tvg
(test volume group)
from the two loop devices and two 3G large LVs named tlv1
and tlv2
on it. Run the pvcreate, vgcreate
,
and lvcreate
commands with -v
to activate
verbose output and try to understand each output line. pvs, vgs, lvs, lvdisplay, pvdisplay
and examine
the output. lvdisplay -m
to examine the mapping of logical
extents to PVs and physical extents. pvs --segments -o+lv_name,seg_start_pe,segtype
to see the map between physical extents and logical extents. lvremove
. Recreate them, but this time use the
--stripes 2
option to lvcreate
. Explain
what this option does and confirm with a suitable command.
The kernel part of the Logical Volume Manager (LVM) is called
device mapper (DM), which is a generic framework to map
one block device to another. Applications talk to the Device Mapper
via the libdevmapper library, which issues requests
to the /dev/mapper/control
character device using the
ioctl(2)
system call. The device mapper is also accessible
from scripts via the dmsetup(8)
tool.
A DM target represents one particular mapping type for ranges
of LEs. Several DM targets exist, each of which which creates and
maintains block devices with certain characteristics. In this section
we take a look at the dmsetup
tool and the relatively
simple mirror target. Subsequent sections cover other targets
in more detail.
dmsetup targets
to list all targets supported
by the currently running kernel. Explain their purpose and typical
use cases. tvg
VG, remove tlv2
.
Convince yourself by running vgs
that tvg
is 10G large, with 3G being in use. Run pvmove
/dev/loop1
to move the used PEs of /dev/loop1
to /dev/loop2
. After the command completes, run
pvs
again to see that /dev/loop1
has no
more PEs in use. /dev/loop3
, make it a
PV and extend the VG with vgextend tvg /dev/loop3
. Remove
tlv1
. Now the LEs of tlv2
fit on any
of the three PVs. Come up with a command which moves them to
/dev/loop3
. vgreduce -a
. Why are they still listed in
the pvs
output? What can be done about that? pvmove(8)
command to move all PEs of
one PV to different PVs in the same VG.
pvmove
is implemented on top of
dm-mirror. Verify your guess by reading the "NOTES" section of the
pvmove(8)
man page. LVM snapshots are based on the CoW optimization strategy described earlier in the chapter on Unix Concepts. Creating a snapshot means to create a CoW table of the given size. Just before a LE of a snapshotted LV is about to be written to, its contents are copied to a free slot in the CoW table. This preserves an old version of the LV, the snapshot, which can later be reconstructed by overlaying the CoW table atop the LV.
Snapshots can be taken from a LV which contains a mounted file system, while applications are actively modifying files. Without coordination between the file system and LVM, the file system most likely has memory buffers scheduled for writeback. These outstanding writes did not make it to the snapshot, so one can not expect the snapshot to contain a consistent file system image. Instead, it is in a similar state as a regular device after an unclean shutdown. This is not a problem for XFS and EXT4, as both are journalling file systems, which were designed with crash recovery in mind. At the next mount after a crash, journalling file systems replay their journal, which results in a consistent state. Note that this implies that even a read-only mount of the snapshot device has to write to the device.
snap_tlv1
of the tlv1
VG by using the
-s
option to lvcreate(8)
. Predict how much
free space is left in the VG. Confirm with vgs tvg
. tlv1
by running
mkfs.ext4 /dev/tvg/lv1
. Guess how much of the snapshot
space has been allocated by this operation. Check with lvs
tvg1/snap_lv1
. lvremove
and recreate
it. Repeat the previous step, but this time run mkfs.xfs
to create an XFS file system. Run lvs tvg/snap_lv1
again and compare the used snapshot space to the EXT4 case. Explain
the difference. tlv1
and snap_tlv1
contain a valid XFS file system. Mount
the file systems on /mnt/1
and /mnt/2
. dd if=/dev/zero of=/mnt/1/zero count=$((2 * 100 *
1024))
to create a 100M large file on tlv1
. Check
that /mnt/2
is still empty. Estimate how much of the
snapshot space is used and check again. dd
command 5 times and run
lvs
again. Explain why the used snapshot space did not
increase. /mnt/1
and confirm by running lvremove tvg/lv1
. lvconvert
command which
replaces the role of the LV and its snapshot. Explain why this solves
the "bad upgrade" problem outlined above. The term "thin provisioning" is just a modern buzzword for over-subscription. Both terms mean to give the appearance of having more resources than are actually available. This is achieved by on-demand allocation. The thin provisioning implementation of Linux is implemented as a DM target called dm-thin. This code first made its appearance in 2011 and was declared as stable two years later. These days it should be safe for production use.
The general problem with thin provisioning is of course that bad things happen when the resources are exhausted because the demand has increased before new resources were added. For dm-thin this can happen when users write to their allotted space, causing dm-thin to attempt allocating a data block from a volume which is already full. This usually leads to severe data corruption because file systems are not really prepared to handle this error case and treat it as if the underlying block device had failed. dm-thin does nothing to prevent this, but one can configure a low watermark. When the number of free data blocks drops below the watermark, a so-called dm-event will be generated to notice the administrator.
One highlight of dm-thin is its efficient support for an arbitrary depth of recursive snapshots, called dm-thin snapshots in this document. With the traditional snapshot implementation, recursive snapshots quickly become a performance issue as the depth increases. With dm-thin one can have an arbitrary subset of all snapshots active at any point in time, and there is no ordering requirement on activating or removing them.
The block devices created by dm-thin always belong to a thin pool which ties together two LVs called the metadata LV and the data LV. The combined LV is called the thin pool LV. Setting up a VG for thin provisioning is done in two steps: First the standard LVs for data and the metatdata are created. Second, the two LVs are combined into a thin pool LV. The second step hides the two underlying LVs so that only the combined thin pool LV is visible afterwards. Thin provisioned LVs and dm-thin snapshots can then be created from the thin pool LV with a single command.
Another nice feature of dm-thin are external snapshots. An external snapshot is one where the origin for a thinly provisioned device is not a device of the pool. Arbitrary read-only block devices can be turned into writable devices by creating an external snapshot. Reads to an unprovisioned area of the snapshot will be passed through to the origin. Writes trigger the allocation of new blocks as usual with CoW. One use case for this is VM hosts which run their VMs on thinly-provisioned volumes but have the base image on some "master" device which is read-only and can hence be shared between all VMs.
Starting with the tvg
VG, create and test a thin pool LV
by performing the following steps. The "Thin Usage" section of
lvmthin(7)
will be helpful.
tlv1
and tlv2
LVs. tdlv
(thin data LV)
and a 500M LV named tmdlv
(thin metada LV). lvconvert
. Run lvs -a
and explain the flags
listed below Attr
. oslv
(over-subscribed
LV). oslv
and mount it on
/mnt
. for ((i = 0; i < 50; i++)): do
... ; done
so that each iteration creates a 50M file named
file-$i
and a snapshot named snap_oslv-$i
of oslv
. lvchange -K
and
try to mount it. Explain what the error message means. Then read the
"XFS on snapshots" section of lvmthin(7)
. lvs
-a
. Mount one snapshot (specifying -o nouuid
)
and run lvs -a
again. Why did the free space decrease
although no new files were written? lvs -a
and dh
-h /mnt
report. Then run the commands to confirm. Guess
what happens if you try to create another 3G file? Confirm
your guess, then read the section on "Data space exhaustion" of
lvmthin(7)
. All three implementations named in the title of this chapter are Linux block layer caches. They combine two different block devices to form a hybrid block device which dynamically caches and migrates data between the two devices with the aim to improve performance. One device, the backing device, is expected to be large and slow while the other one, the cache device, is expected to be small and fast.
The most simple setup consists of a single rotating disk and one SSD. The setup shown in the diagram at the left is realistic for a large server with redundant storage. In this setup the hybrid device (yellow) combines a raid6 array (green) consisting of many rotating disks (grey) with a two-disk raid1 array (orange) stored on fast NVMe devices (blue). In the simple setup it is always a win when I/O is performed from/to the SSD instead of the rotating disk. In the server setup, however, it depends on the workload which device is faster. Given enough rotating disks and a streaming I/O workload, the raid6 outperforms the raid1 because all disks can read or write at full speed.
Since block layer caches hook into the Linux block API described earlier, the hybrid block devices they provide can be used like any other block device. In particular, the hybrid devices are file system agnostic, meaning that any file system can be created on them. In what follows we briefly describe the differences between the three block layer caches and conclude with the pros and cons of each.
Bcache is a stand-alone stacking device driver which was included in the Linux kernel in 2013. According to the bcache home page, it is "done and stable". dm-cache and dm-writecache are device mapper targets included in 2013 and 2018, respectively, which are both marked as experimental. In contrast to dm-cache, dm-writecache only caches writes while reads are supposed to be cached in RAM. It has been designed for programs like databases which need low commit latency. Both bcache and dm-cache can operate in writeback or writethrough mode while dm-writecache always operates in writeback mode.
The DM-based caches are designed to leave the decision as to what data to migrate (and when) to user space while bcache has this policy built-in. However, at this point only the Stochastic Multiqueue (smq) policy for dm-cache exists, plus a second policy which is only useful for decommissioning the cache device. There are no tunables for dm-cache while all the bells and whistles of bcache can be configured through sysfs files. Another difference is that bcache detects sequential I/O and separates it from random I/O so that large streaming reads and writes bypass the cache and don't push cached randomly accessed data out of the cache.
bcache is the clear winner of this comparison because it is stable, configurable and performs better at least on the server setup described above because it separate random and sequential I/O. The only advantage of dm-cache is its flexibility because cache policies can be switched. But even this remains a theoretical advantage as long as only a single policy for dm-cache exists.
mkfs
commands sends this command to discard all blocks of the device.
Discuss the implications when mkfs.
is run on a device
provided by bcache or dm-cache.
This device mapper target provides encryption of arbitrary block devices by employing the primitives of the crypto API of the Linux kernel. This API provides a uniform interface to a large number of cipher algorithms which have been implemented with performance and security in mind.
The cipher algorithm of choice for the encryption of block devices is the Advanced Encryption Standard (AES), also known as Rijndael, named after the two Belgian cryptographers Rijmen and Daemen who proposed the algorithm in 1999. AES is a symmetric block cipher. That is, a transformation which operates on fixed-length blocks and which is determined by a single key for both encryption and decryption. The underlying algorithm is fairly simple, which makes AES perform well in both hardware and software. Also the key setup time and the memory requirements are excellent. Modern processors of all manufacturers include instructions to perform AES operations in hardware, improving speed and security.
According to the Snowden documents, the NSA has been doing research on breaking AES for a long time without being able to come up with a practical attack for 256 bit keys. Successful attacks invariably target the key management software instead, which is often implemented poorly, trading security for user-friendliness, for example by storing passwords weakly encrypted, or by providing a "feature" which can decrypt the device without knowing the password.
The exercises of this section ask the reader to encrypt a loop device with AES without relying on any third party key management software
.
cat /dev/urandom
do the same? /dev/loop0
from the file. dmsetup(8)
command is
a single line of the form start_sector num_sectors target_type
target_args
. Determine the correct values for the first three
arguments to encrypt /dev/loop0
. target_args
for the dm-crypt target are
of the form cipher key iv_offset device offset
. To
encrypt /dev/loop0
with AES-256, cipher
is aes
, device
is /dev/loop0
and both offsets are zero. Come up with an idea to create a 256 bit
key from a passphrase. create
subcommand of dmsetup(8)
creates a device from the given table. Run a command of
the form echo "$table" | dmsetup create cryptdev
to create the encrypted device /dev/mapper/cryptdev
from the loop device. /dev/mapper/cryptdev
,
mount it and create the file passphrase
containing
the string "super-secret" on this file system. cryptdev
device and run dmsetup
remove cryptdev
. Run strings
on the loop device
and on the underlying file to see if it contains the string
super-secret"
or passphrase
. cryptdev
device, but this time use
a different (hence invalid) key. Guess what happens and confirm. stty -echo
),
reads a passphrase from stdin and combines the above steps to create
and mount an encrypted device.
/* Link with -lcrypto */
#include <openssl/rand.h>
#include <stdio.h>
#include <unistd.h>
#include <stdio.h>
int main(int argc, char **argv)
{
unsigned char buf[1024 * 1024];
for (;;) {
int ret = RAND_bytes(buf, sizeof(buf));
if (ret <= 0) {
fprintf(stderr, "RAND_bytes() error\n");
exit(EXIT_FAILURE);
}
ret = write(STDOUT_FILENO, buf, sizeof(buf));
if (ret < 0) {
perror("write");
exit(EXIT_FAILURE);
}
}
return 0;
}