OS-Level VirtualizationFools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it. -- Perlis's Programming Proverb #58 (1982) |
cat /proc/cpuinfo
. Hint: svm and vmx. Initially, there is only a single namespace of each type called the
root namespace. All processes belong to this namespace. The
clone(2)
system call is a generalization of the classic
fork(2)
which allows privileged users to create new
namespaces by passing one or more of the six NEW_
flags. The child process is made a member of the new namespace. Calling
plain fork(2)
or clone(2)
with no
NEW_*
flag lets the newly created process inherit the
namespaces from its parent. There are two additional system calls,
setns(2)
and unshare(2)
which both
change the namespace(s) of the calling process without creating a
new process. For the latter, there is a user command, also called
unshare(1)
which makes the namespace API available to
scripts.
The /proc/$PID
directory of each process contains a
ns
subdirectory which contains one file per namespace
type. The inode number of this file is the namespace ID.
Hence, by running stat(1)
one can tell whether
two different processes belong to the same namespace. Normally a
namespace ceases to exist when the last process in the namespace
terminates. However, by opening /proc/$PID/ns/$TYPE
one can prevent the namespace from disappearing.
uname(2)
system
call which fills out the fields of a struct utsname
.
On return the nodename
field of this structure
contains the hostname which was set by a previous call to
sethostname(2)
. Similarly, the domainname
field
contains the string that was set with setdomainname(2)
.
UTS namespaces provide isolation of these two system identifiers. That
is, processes in different UTS namespaces might see different host- and
domain names. Changing the host- or domainname affects only processes
which belong to the same UTS namespace as the process which called
sethostname(2)
or setdomainname(2)
.
chroot(2)
system call which was introduced in 1979. Mount namespaces isolate
the mount points seen by processes so that processes in different
mount namespaces can have different views of the file system hierarchy.
Like for other namespace types, new mount namespaces are created by
calling clone(2)
or unshare(2)
. The
new mount namespace starts out with a copy of the caller's mount
point list. However, with more than one mount namespace the
mount(2)
and umount(2)
system calls no longer
operate on a global set of mount points. Whether or not a mount
or unmount operation has an effect on processes in different mount
namespaces than the caller's is determined by the configurable
mount propagation rules. By default, modifications to the list
of mount points have only affect the processes which are in the same
mount namespace as the process which initiated the modification. This
setting is controlled by the propagation type of the
mount point. Besides the obvious private and shared types, there is
also the MS_SLAVE
propagation type which lets mount
and unmount events propagate from from a "master" to its "slaves"
but not the other way round.
ip link set iface netns PID
. Processes only see interfaces
whose network namespace matches the one they belong to. This lets
processes in different network namespaces have different ideas about
which network devices exist. Each network namespace has its own IP
stack, IP routing table and TCP and UDP ports. This makes it possible
to start, for example, many sshd(8)
processes which
all listen on "their own" TCP port 22.
An OS-level virtualization framework typically leaves physical
interfaces in the root network namespace but creates a dedicated
network namespace and a virtual interface pair for each container. One
end of the pair is left in the root namespace while the other end is
configured to belong to the dedicated namespace, which contains all
processes of the container.
CLONE_NEWPID
flag to clone(2)
, the
child process gets some unused PID in the original PID namepspace
but PID 1 in the new namespace.
As as consequence, processes in different PID namespaces can have the
same PID. In particular, there can be arbitrary many "init" processes,
which all have PID 1. The usual rules for PID 1 apply within each PID
namespace. That is, orphaned processes are reparented to the init
process, and it is a fatal error if the init process terminates,
causing all processes in the namespace to terminate as well. PID
namespaces can be nested, but under normal circumstances they are
not. So we won't discuss nesting.
Since each process in a non-root PID namespace has also a PID in the
root PID namespace, processes in the root PID namespace can "see" all
processes but not vice versa. Hence a process in the root namespace can
send signals to all processes while processes in the child namespace
can only send signals to processes in their own namespace.
Processes can be moved from the root PID namespace into a child
PID namespace but not the other way round. Moreover, a process can
instruct the kernel to create subsequent child processes in a different
PID namespace.
unshare(2)
or clone(2)
.
The UID and GID of a process can be different in different
namespaces. In particular, an unprivileged process may have UID
0 inside an user namespace. When a process is created in a new
namespace or a process joins an existing user namespace, it gains full
privileges in this namespace. However, the process has no additional
privileges in the parent/previous namespace. Moreover, a certain flag
is set for the process which prevents the process from entering yet
another namespace with elevated privileges. In particular it does not
keep its privileges when it returns to its original namespace. User
namespaces can be nested, but we don't discuss nesting here.
Each user namespace has an owner, which is the effective user
ID (EUID) of the process which created the namespace. Any process
in the root user namespace whose EUID matches the owner ID has all
capabilities in the child namespace.
If CLONE_NEWUSER
is specified together with other
CLONE_NEW*
flags in a single clone(2)
or unshare(2)
call, the user namespace is guaranteed
to be created first, giving the child/caller privileges over the
remaining namespaces created by the call.
It is possible to map UIDs and GIDs between namespaces. The
/proc/$PID/uid_map
and /proc/$PID/gid_map
files
are used to get and set the mappings. We will only talk about UID
mappings in the sequel because the mechanism for the GID mappings are
analogous. When the /proc/$PID/uid_map
(pseudo-)file is
read, the contents are computed on the fly and depend on both the user
namespace to which process $PID
belongs and the user
namespace of the calling process. Each line contains three numbers
which specify the mapping for a range of UIDs. The numbers have
to be interpreted in one of two ways, depending on whether the two
processes belong to the same user namespace or not. All system calls
which deal with UIDs transparently translate UIDs by consulting these
maps. A map for a newly created namespace is established by writing
UID-triples once to one uid_map
file. Subsequent writes will fail.
/proc/$$/mounts
,
/proc/$$/mountinfo
, and /proc/$$/mountstats
.
/mnt
before the container is
started. utc-ns.c
, a minimal C
program which illustrates how to create a new UTS namespace. Explain
each line of the source code. ls -l /proc/$$/ns
to see the namespaces of
the shell. Run stat -L /proc/$$/ns/uts
and confirm
that the inode number coincides with the number shown in the target
of the link of the ls
output.
/proc/1/stat
to confirm. pid-ns.c
program. Will the
two numbers printed as PID
and child PID
be the same? What will be the PPID number? Compile and run the program
to see if your guess was correct.
ip link show
. Start a second shell in a
different network namespace and confirm by running the same command
that no network interfaces exist in this namespace. In the original
namespace, set the namespace of one end of the pair to the process ID
of the second shell and confirm that the interface "moved" from one
namespace to the other. Configure (different) IP addresses on both ends
of the pair and transfer data through the ethernet tunnel between the
two shell processes which reside in different network namespaces. ethtool -k iface
to find out which devices are network namespace local. uid_map
file has
not been written, system calls like setuid(2)
which
change process UIDs fail. Why? shmctl(2)
system call performs operations on a System V
shared memory segment. It operates on a shmid_ds
structure
which contains in the shm_lpid
field the PID of the process
which last attached or detached the segment. Describe the implications this API
detail has on the interaction between IPC and PID namespaces.
mkdir(2)
system call creates a new cgroup. To add a process to a cgroup
one must write its PID to one of the files in the pseudo file system.
We will cover both cgroup versions because as of 2018-11 many
applications still rely on cgroup-v1 and cgroup-v2 still lacks some
of the functionality of cgroup-v1. However, we will not look at
all controllers.
/dev/cpuset
. This file system is only kept for backwards
compability and is otherwise equivalent to the corresponding part of
the cgroup pseudo file system. The cpuset controller links subsets
of CPUs to cgroups so that the processes in a cgroup are confined to
run only on the CPUs of "their" subset.
The CPU controller of cgroup-v2, which is simply called "cpu", works
differently. Instead of specifying the set of admissible CPUs for a
cgroup, one defines the ratio of CPU cycles for the cgroup. Work to
support CPU partitioning as the cpuset controller of cgroup-v1 is in
progress and expected to be ready in 2019.
open(2)
and
mknod(2)
system calls and enforces the restrictions
defined in the device access whitelist of the cgroup the
calling process belongs to.
Processes in the root cgroup have full permissions. Other cgroups
inherit the device permissions from their parent. A child cgroup
never has more permission than its parent.
Cgroup-v2 takes a completely different approach to device access
control. It is implemented on top of BPF, the Berkeley packet
filter. Hence this controller is not listed in the cgroup-v2
pseudo file system.
SIGSTOP/SIGCONT
to all processes, but avoids some problems
with corner cases. The v2 version was added in 2019-07. It is available
from Linux-5.2 onwards.
limit_in_bytes
.
The cgroup-v2 version of the memory controller is rather more complex
because it attempts to limit direct and indirect memory usage of
the processes in a cgroup in a bullet-proof way. It is designed to
restrain even malicious processes which try to slow down or crash
the system by indirectly allocating memory. For example, a process
could try to create many threads or file descriptors which all cause a
(small) memory allocation in the kernel. Besides several tunables and
statistics, the memory controller provides the memory.events
file whose contents change whenever a state transition
for the cgroup occurs, for example when processes are started to get
throttled because the high memory boundary was exceeded. This file
could be monitored by a management agent to take appropriate
actions. The main mechanism to control the memory usage is the
memory.high
file.
mount -t cgroup none /var/cgroup
and
mount -t cgroup2 none /var/cgroup2
to mount both cgroup pseudo
file systems and explore the files they provide. echo 0 > cpuset.mems && echo 0 >
cpuset.cpus
. For v2: First activate controllers for the cgroup
in the parent directory. stress -c 2
.
echo 1000000 1000000 > cpu.max
. while :; do date; sleep 1; done
. Freeze
and unfreeze the cgroup by writing the string FROZEN
to a suitable freezer.state
file in the cgroup-v1 file
system. Then unfreeze the cgroup by writing THAWED
to the same file. Find out how one can tell whether a given cgroup
is frozen. ddrescue /dev/sdX
/dev/null
. Enforce a read bandwidth rate of 1M/s for the
device by writing a string of the form "$MAJOR:$MINOR $((1024 *
1024))"
to a file named blkio.throttle.read_bps_device
in the cgroup-v1 pseudo file system. Check that the bandwidth
was indeed throttled by running the above ddrescue
command again. $MAJOR:MINOR rbps=$((1024 * 1024))"
to a file named
io.max
. bash
, start a second
bash
process and print its PID with echo $$
.
Guess what happens if you run kill -STOP $PID; kill -CONT
$PID
from a second terminal, where $PID
is the PID that was printed in the first terminal. Try it out,
explain the observed behaviour and discuss its impact on the freezer
controller. Repeat the experiment but this time use the freezer
controller to stop and restart the bash process. Containers provide resource management through control groups and
resource isolation through namespaces. A container platform
is thus a software layer implemented on top of these features. Given a
directory containing a Linux root file system, starting the container
is a simple matter: First clone(2)
is called with the
proper NEW_*
flags to create a new process in a suitable
set of namespaces. The child process then creates a cgroup for the
container and puts itself into it. The final step is to let the child
process hand over control to the container's /sbin/init
by calling exec(2)
. When the last process in the newly
created namespaces exits, the namespaces disappear and the parent
process removes the cgroup. The details are a bit more complicated,
but the above covers the essence of what the container startup command
has to do.
Many container platforms offer additional features not to be discussed here, like downloading and unpacking a file system image from the internet, or supplying the root file system for the container by other means, for example by creating an LVM snapshot of a master image. In this section we look at micoforia, a minimalistic container platform to boot a container from an existing root file system as described above.
The containers known to micoforia are defined in the single
~/.micoforiarc
configuration file whose format is
described in micoforia(8)
. The micoforia
command supports various subcommands to maintain containers. For
example, containers are started with a command such as micoforia
start c1
where c1
is the name of the
container. One can execute a shell running within the container
with micoforia enter c1
, log in to a local pseudo
terminal with micoforia attach c1
, or connect via ssh
with ssh c1
. Of course the latter command only works
if the network interface and the DNS record get configured during
container startup and the sshd package is installed. The container can
be stopped by executing halt
from within the container,
or by running micoforia stop c1
on the host system. The
commands micoforia ls
and micoforia ps
print information about containers and their processes.
The exercises ask the reader to install the micoforia package from source, and to set up a minimal container running Ubuntu Linux.
git://git.tuebingen.mpg.de/micoforia
and compile the
source code with ./configure && make
. Install with
make install
. debootstrap --download-only
--include isc-dhcp-client bionic /var/lib/micoforia/c1/
http://de.archive.ubuntu.com/ubuntu
. micoforia(8)
. Consult the Link Layer section of the
chapter on networking if you would like to understand what you are
doing. /etc/fstab
to configure
the cgroup filesystems, create corresponding directories and run
mount -a
to mount them. none /var/cgroup cgroup devices 0 0 none /var/cgroup2 cgroup2 defaults 0 0
c1
with echo
container c1 > ~/.micoforiarc
. This container will have a
single network device and neither CPU nor memory isolation will be
enforced for the processes of the container. micoforia --
start -F c1
. micoforia stop c1
to stop the container,
edit ~/.micoforiarc
and add the following two lines to
configure memory and CPU limits.
cpu-cores c1:1 memory-limit c1:1Start the container and run suitable commands which show that the newly configured limits are in effect.
#define _GNU_SOURCE
#include <sys/utsname.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
static void print_hostname_and_exit(const char *pfx)
{
struct utsname uts;
uname(&uts);
printf("%s: %s\n", pfx, uts.nodename);
exit(EXIT_SUCCESS);
}
static int child(void *arg)
{
sethostname("jesus", 5);
print_hostname_and_exit("child");
}
#define STACK_SIZE (64 * 1024)
static char child_stack[STACK_SIZE];
int main(int argc, char *argv[])
{
clone(child, child_stack + STACK_SIZE, CLONE_NEWUTS, NULL);
print_hostname_and_exit("parent");
}
#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
static int child(void *arg)
{
printf("PID: %d, PPID: %d\n", (int)getpid(), (int)getppid());
}
#define STACK_SIZE (64 * 1024)
static char child_stack[STACK_SIZE];
int main(int argc, char *argv[])
{
pid_t pid = clone(child, child_stack + STACK_SIZE, CLONE_NEWPID, NULL);
printf("child PID: %d\n", (int)pid);
exit(EXIT_SUCCESS);
}