History and Philosophy

Early Unix History

Unix was created in 1969 as a successor of Multics, the MULTiplexed Information and Computing Service, which had been in use since the mid 1960s as the successor of CTSS, the Compatible Time-Sharing System of the early 1960s. Multics aimed to get CTSS right, but failed in this regard and was eventually discontinued because of its complexity. The Unix approach was very different as it was brainstormed by only three people and then implemented by Ken Thompson at Bell Laboratories in two days. Unlike its predecessors it focused on elegance and simplicity. The name was originally spelt UNICS (UNiplexed Information and Computing Service) to emphasize the contrast to Multics.

The original Unix implementation was written in assembly language for the 18 bit processor of the PDP-7 "minicomputer", a device of the size of a wardrobe which was considered small by the standards of the day. Like all computers of this era, the PDP-7 was not connected to a video screen. Instead, input had to be typed in on the console, a device which looked much like an electric typewriter. Output from the computer was printed on rolls of paper. Since the assembly instructions could not easily be ported to different hardware, Dennis Ritchie invented the C programming language in 1971. By 1973 the complete Unix implementation had been rewritten in C. The C language was another corner stone in the history of Unix which turned out to be very successful. While other programming languages of that time have long been abandoned or play merely a niche role, C is still one of the most widely used programming languages today. The first Unix application was roff, a typesetting program which is still ubiquitous as the manual pages which ship with every Unix system are formatted with roff.

From the beginning, Thompson and his early collaborators encouraged close communication between programmers, creating an early form of community. Up to the present day, this "hacker culture" has been a stimulus for countless improvements. Copies of Unix were distributed on tapes, hand-signed with "Love, Ken". Over time many universities contributed to Unix. By the end of the 1970s, Unix accumulated a whole bunch of utilities that made it a fully flavored operating system which was also free of any copyright claims.

Despite the primitive hardware of the time, the early Unix was remarkably similar to modern Linux systems. For example, the task scheduler, the hierarchical filesystem tree and the shell already existed back then.

Networking

The Advanced Research Projects Agency (ARPA) was a military research unit that was part of the USA's department of defence. It was established in the early 1960s with the mandate to create systems that could survive a nuclear war. The agency created the arpanet, the predecessor of today's internet, which was designed to stay operational after subordinate network losses. By the end of the 1960s and the early 1970s, the fundamental networking protocols were established: telnet for remote login was standardized in 1969, email (SMTP) in 1971, and the file transfer protocol (FTP) in 1973.

By the end of the 1970s many Unix installations existed in all parts of the world. However, the arpanet was mostly powered by commercial Multics systems because Unix only had rudimentary network support (UUCP, the Unix to Unix copy) which could copy files over telephone lines via modems but not much more. This changed in 1983 when TCP/IP networking was developed by the ARPA to replace the arpanet. Unix support for TCP/IP was developed at Berkeley University which had become the "Mecca" for Unix development and started already in 1977 to release their own Unix system named BSD, the Berkeley Software Distribution.

Commercialization, POSIX and GNU

With excellent networking support and no licensing issues, it was only a matter of time until companies became interested in Unix in order to make money. Several companies started to commercialize Unix by adding features to the common code base but keeping their improvements closed, effectively stopping the source code from being freely distributable. At the same time Microsoft began to sell their DOS operating system, targeting small businesses and the home market. DOS lacked many features that Unix already had for a decade, like multi-tasking and multi-user support, but it did run on the cheap Intel 286 processors that were too weak for Unix.

By 1985 the commercialization of Unix and the success of Microsoft had damaged the Unix community badly. But also the various companies that sold their particular proprietary Unix brand realized that too many incompatible Unix implementations would only hurt their business. That's where the Computer Society of the Institute of Electrical and Electronics Engineers (IEEE) became involved in Unix. The IEEE is an organization which was already founded in 1946 to advance the theory, practice, and application of computer technology. This organization created POSIX, the Portable Operating System Interface for Unix, which is a family of specifications for maintaining compatibility between operating systems. The first version of POSIX was published in 1988. It covered several command line utilities including vi(1) and awk(1), the shell scripting language, application programmer interfaces (APIs) for I/O (input/output) and networking, and more. Up to the present day POSIX is maintained by the IEEE and new revisions of the POSIX standard are published regularly.

In 1983 Richard Stallman launched the GNU project and the Free Software Foundation as a reaction to the ongoing commercialization of Unix. GNU, which is is a recursive acronym for "GNU's not Unix", aimed to keep the Unix source code free, or to replace non-free parts by open source equivalents. To this aim the GNU project created the GNU General Public License (GPL), which requires not only the source code to stay free, but also that all subsequent modifications to the code base remain free. By the end of the 80s, the GNU toolset had become a full developer software stack licensed under the GPL. This set of software packages was complemented by the X window system, which was also released under a free license and enabled programmers to build graphical applications for desktop systems. Moreover, the first open source scripting language, perl, was released in 1987.

Linux

In 1985 Intel announced the 386 processor which, unlike its 286 predecessor, was powerful enough to run Unix. There were efforts to port the Unix operating system kernel to this hardware, but these efforts were impaired by pending lawsuits about who owns the copyright on the BSD source code. Due to the unclear legal situation of the BSD code, the major missing piece in the GNU toolset was a free operating system kernel. This hole was filled in 1991 when Linus Torvalds, a student from Helsinki in Finland, announced the first version of his Linux kernel.

Linux did not repeat the licensing problems of the original Unix because the Linux source code was written from scratch and licensed under the GPL. Due to this difference many developers moved from Unix to Linux, so Linux grew quickly and started soon to outperform the commercial Unix kernels in almost every benchmark. The cheap 386 hardware, the Linux kernel, the GNU toolset and the graphical user interface based on the X window system facilitated cheap workstations which ran a complete open source software stack.

The success of Linux, or GNU/Linux as some prefer to call it for reasons that should now be clear, steadily increased over time. In 2003 the SCO group, a company which sold a proprietary Unix system, was unhappy about this progress and sued IBM, which offered various Linux products. SCO claimed to be the owner of Unix, and that Linux contained "millions of lines" of code copied from Unix. SCO's lawyers argued that the success of Linux originated from this theft of intellectual property and asked for $5 billion as compensation for the resulting losses. The company also tried to collect taxes from other Linux users. Microsoft funded SCO in these efforts.

In the end SCO lost the lawsuit since it was evident that all that copied code never existed. In fact, the court ruled that SCO did not even own the Unix copyrights to begin with. Another fun fact is that the large number of bugs in the early Linux code actually helped to prove that Linux was original work. The long term effects of this lawsuit, an improved position of Linux and its ecosystem, last until the presence. Commercial Unix systems have become irrelevant as Linux runs on a wide variety of machines ranging from supercomputers to workstations, smart phones and IOT (internet of things) devices with very limited resources.

While SCO went bankrupt eventually, some of the companies which almost killed Unix by maximizing their own profit still exist, and make money with Linux today. However, they had to adjust their business model in order to comply with the GPL. Rather than selling proprietary software, they bundle open source software and sell support to paying customers. Some companies also sell hardware with Linux pre-installed.

Linux Distributions

A Linux Distribution is a conglomeration of free software, including the Linux kernel, the GNU toolset and the X window system, plus possibly other, proprietary software on top of that. Usually a distribution also includes an installer and a package manager to make it easy to install and update packages according to the users' needs.

There are hundreds of Linux distributions, and new distributions are created all the time while others are discontinued. Many distributions are backed by companies which target specific classes of users or hardware, but there are also non-commercial Linux distributions which are solely driven by a community of volunteers.

One of the most popular company-backed Linux distributions is Ubuntu, which is led since 2004 by the UK-based Canonical Ltd. It targets unskilled desktop users which would like to switch away from Microsoft Windows. One reason for the popularity of Ubuntu is that it is very easy to install on standard desktop and laptop hardware. A distinguishing feature of Ubuntu is its strict release cycles: New versions are released in April and October of each year, and every fourth release is a long-term support (LTS) release which will be supported for at least five years. Ubuntu also features a variant for server hardware which contains a different Linux kernel and ships with most desktop packages excluded.

The main community-driven Linux distribution is Debian. The Debian project was founded in 1993 and the first stable version was released in 1996. Debian is used as the basis for many other distributions. In fact, Ubuntu is based on Debian. The development of Debian closely follows the Unix culture in that it is developed openly and distributed freely. A team of about 1000 core developers work together with countless package maintainers according to the Debian Social Contract, the Debian Constitution, and the Debian Free Software Guidelines.

Exercises

Characteristics of a Unix system

After having briefly reviewed the history of Unix, we now look closer at the various components which comprise a Unix system and which distinguish Unix from other operating systems. We focus on general design patterns that have existed since the early Unix days and are still present on recent Linux systems.

Single Hierarchy of Files

The most striking difference between Unix and Windows is perhaps that on Unix the files of all devices are combined to form a single hierarchy with no concept of drive letters. When the system boots, there is only one device, the root device, which contains the root directory. To make the files of other devices visible, the mount operation must be employed. This operation attaches the file hierarchy of the given device to the existing hierarchy at a given location which is then called the mountpoint of the device. Mountpoints are thus the locations in the hierarchy where the underlying storage device changes.

The root directory contains a couple of well-known subdirectories, each of which is supposed to contain files of a certain type or for a certain purpose. The following table lists a subset:

The Filesystem Hierarchy Standard describes the various subdirectories in more detail. The exercises ask the reader to become acquainted with this directory structure.

POSIX Commands and Shell

The Filesystem Hierarchy Standard lists /bin and /sbin and several other directories for executable files. The POSIX standard defines which executables must exist in one of these directories for the system to be POSIX-compliant. Well over 100 POSIX commands are listed in the XCU volume of this standard. Besides the names of the commands, the general behaviour of each and the set of command line options and their semantics are described. POSIX versions are designed with backwards compatibility in mind. For example, a new POSIX version might require a command to support additional command line options, but existing options are never dropped and never change semantics in incompatible ways. The target audience of the POSIX document are programmers who implement and maintain the POSIX commands and users which want to keep their software portable across different Unix flavors.

One of the POSIX commands is the shell, /bin/sh, an interpreter that reads input expressed in the shell command language, which is also part of POSIX. The shell transforms the input in various ways to produce commands and then executes these commands. The user may enter shell code (i.e., code written in the shell command language) interactively at the command prompt, or supply the input for the shell as a shell script, a text file which contains shell code. Shell scripts which only contain POSIX commands and use only POSIX options are portable between different shell implementations and between different Unix flavors. They should therefore never cease to work after an upgrade. Among the many available POSIX shell implementations, GNU bash is one of the more popular choices. Bash is fully POSIX compatible and offers many more features on top of what is required by POSIX.

Several implementations of the POSIX commands exist. On Linux the GNU implementation is typically installed while FreeBSD, NetBSD and MacOS contain the BSD versions. Although all implementations are POSIX-compliant, they differ considerably because different implementations support different sets of additional features and options which are not required by POSIX. These extensions are not portable, and should thus be avoided in shell scripts that must work on different Unix flavors.

In addition to the POSIX commands, a typical Unix system might well contain thousands of other commands. This illustrates another aspect that is characteristic for Unix: Tools should do only one specific task, and do it well. The operating system provides mechanisms to combine the simple commands in order to form more powerful programs. For example, commands can be chained together so that the output of one command becomes the input for the next command in the chain. This is the idea behind pipes, a Unix concept which dates back to 1973 and which is also covered by POSIX. We shall come back to pipes and related concepts in a later section.

Multi-User, Multi-Tasking, Isolation

From the very beginning Unix was designed to be a multi-user and a multi-tasking operating system. That is, it could run multiple programs on behalf of different users independently of each other and isolated from each other. This design was chosen to improve hardware utilization and robustness. In contrast, DOS and early versions of Windows were designed for personal computing (PC) and had no notion of user accounts, access permissions or isolation. This resulted in an unstable system because a single misbehaving program was enough to take down the whole system. Therefore these features had to be retrofitted later.

While multi-tasking makes all tasks appear to run simultaneously even if there are more tasks than CPUs, isolation refers to memory protection, a mechanism which prevents applications from interfering with each other and with the internals of the operating system. A running Unix system maintains two sets of running tasks: besides the application tasks there is also a set of kernel tasks. Unlike the application tasks, the kernel tasks are privileged in that they can access the memory of the application tasks while applications tasks can only access their own memory. Isolation is achieved by a hardware concept called protection domains, which existed already in Multics and thus predates Unix. In the simplest case, there are only two protection domains: a privileged domain called ring 0 for the kernel tasks, and an unprivileged domain for application tasks, also called user processes in this context. The CPU is always aware of the current protection domain as this information is stored in a special CPU register.

System Calls and the C POSIX Library

Only when the CPU is running in the privileged ring 0 domain (also known as kernel mode, as opposed to user mode for application tasks), it can interact directly with hardware and memory. If an application wants to access hardware, for example read a data block from a storage device, it can not do so by itself. Instead, it has to ask the operating system to perform the read operation on behalf of the application. This is done by issuing a system call. Like function calls, system calls interrupt the current program, continue execution at a different address and eventually return to the instruction right after the call. However, in addition to this, they also cause the CPU to enter kernel mode so that it can perform the privileged operation. When the system call has done its work and is about to return to the application, the protection domain is changed again to let the CPU re-enter user mode.

The system calls thus define the interface between applications and the operating system. For backwards compatibility it is of utmost importance that system calls never change semantics in an incompatible way. Moreover, system calls must never be removed because this would again break existing applications. The syntax and the semantics of many system calls are specified in POSIX, although POSIX does not distinguish between functions and system calls and refers to both as system functions. This is because system calls are typically not performed by the application directly. Instead, if an application calls, for example, read(), it actually calls the compatibility wrapper for the read system call which is implemented as a function in the C POSIX Library (libc), which ships with every Unix system. It is this library which does the hard work, like figuring out which system calls are supported on the currently running kernel, and how kernel mode must be entered on this CPU type. Like the POSIX commands, the system functions described in POSIX never change in incompatible ways, so programs which exclusively use POSIX system functions are portable between different Unix flavors and stay operational after an upgrade.

Multi-Layer Configuration Through Text Files

On a multi-user system it becomes necessary to configure programs according to each user's personal preferences. The Unix way to achieve this is to provide four levels of configuration options for each program. First, there are the built-in defaults which are provided by the author of the program. Next, there is the system-wide configuration that is controlled by the administrator. Third, there is the user-defined configuration, and finally there are the command line options. Each time the program is executed, the four sets of configuration options are applied one after another so that the later sets of options override the earlier settings. The system-wide configuration is stored in /etc while the user-defined configuration is stored in that user's home directory. Both are are simple text files that can be examined and modified with any text editor. This makes it easy to compare two configurations and to transfer the configuration across different machines or user accounts.

Everything is a File

Another mantra which is often heard in connection with Unix is everything is a file. This phrase, while certainly catchy, is slightly incorrect. A more precise version would be everything is controlled by a file descriptor, or, as Ritchie and Thompson stated it, Unix has compatible file, device, and inter-process I/O. Modern Unix systems have pushed this idea further and employ file descriptors also for networking, process management, system configuration, and for certain types of events. The file descriptor concept is thus an abstraction which hides the differences between the objects the file descriptors refer to. It provides a uniform interface for the application programmer who does not need to care about whether a file descriptor refers to a file, a network connection, a peripheral device or something else because the basic I/O operations like open, read, write are the same.

File descriptors are ubiquitous since every Unix program uses them, albeit perhaps implicitly via higher-level interfaces provided by a scripting language. We shall return to this topic when we discuss processes.

Manual Pages

All POSIX commands and most other programs are installed along with one or more man pages (short for manual pages), which are plain text files that can be formatted and displayed in various ways. This concept was introduced in 1971 as part of the Unix Programmer's Manual. The characteristic page layout and the typical sections (NAME, SYNOPSIS, DESCRIPTION, EXAMPLES, SEE ALSO) of a man page have not changed since then. The POSIX man command is used to view man pages in a terminal. For example, the command man ls opens the man page of the ls command, and man man shows the man page of the man command itself. Most implementations also maintain a database of the existing man pages and provide additional commands to query this database. For example, the whatis command prints the one-line description of all man pages which match a pattern while the apropos command searches the manual page names and descriptions.

In addition to the man pages for commands, there are man pages for system calls, library functions, configuration files and more. Each man page belongs to one of several man sections. For example, the aforementioned man pages for ls and man are part of section 1 (user commands) while section 2 is reserved for system calls and section 8 for administration commands that can only be executed by privileged users. By convention, to indicate which section a command or a function belongs to, the man section is appended in parenthesis as in mount(8). Most Unix systems also offer translated man pages for many languages as an optional package. Note that the same name may refer to more than one man page. For example there is kill(1) for the user command that kills processes and also kill(2) which describes the corresponding system call. To open the man page of a specific section, one may use a command like man 2 kill. The MANSECT environment variable can be set to a colon-delimited list of man sections to change the order in which the man sections are searched.

Consulting the local man pages rather than searching the web has some advantages. Most importantly, the local pages will always give correct answers since they always match the installed software while there is no such relationship between a particular web documentation page and the version of the software package that is installed on the local computer. Working with man pages is also faster, works offline and helps the user to stay focused on the topic at hand.

Exercises

Homework

Think about printers, sound cards, or displays as a file. Specifically, describe what open, read, and write should mean for these devices.

Solution

Opening would establish a (probably exclusive) connection to the device. Reading from the file descriptor returned by open(2) could return all kinds of status information, like the type, model and capabilities of the device. For example, printers could return the number of paper trays, the amount of toner left etc. Writing to the file descriptor would cause output on the device. This would mean to print the text that is written, play the audio samples, or show the given text on the display. The point to take away is that the open, read, write interface is a generic concept that works for different kinds of devices, not only for storing data in a file on a hard disk.

Paths, Files and Directories

In this section we look in some detail at paths, at a matching language for paths, and at the connection between paths and files. We then describe the seven Unix file types and how file metadata are stored. We conclude with the characteristics of soft and hard links.

Paths

The path concept was introduced in the 1960s with the Multics operating system. Paths will be familiar to the reader because they are often specified as arguments to commands. Also many system calls receive a path argument. A path is a non-empty string of path components which are separated by slash characters. An absolute path is a path that starts with a slash, all other paths are called relative. A relative path has to be interpreted within a context that implies the leading part of the path. For example, if the implied leading part is /foo/bar, the relative path baz/qux is equivalent to the absolute path /foo/bar/baz/qux.

Given a path, there may or may not exist a file or a directory that corresponds to the path. Path lookup is the operation which determines the answer to this question, taking the implied leading part into account in case of relative paths. This operation is always performed within the kernel and turns out to be surprisingly complex due to concurrency and performance issues. Consult path_resolution(7) on a Linux system to learn more about how pathnames are resolved to files.

If a path was successfully looked up, each path component up to the second-last refers to an existing directory while the last component refers to either a file or a directory. In both cases the directory identified by the second-last component contains an entry named by the last component. We call those paths valid. The valid paths give rise to a rooted tree whose interior nodes are directories and whose leaf nodes are files or directories. Note that the validity of a path depends on the set of existing files, not just on the path itself, and that a valid path may become invalid at any time, for example if a file is deleted or renamed. Many system calls which receive a path argument perform path lookup and fail with the No such file or directory error if the lookup operation fails.

It depends on the underlying filesystem whether the path components are case-sensitive or case-insensitive. That is, whether paths which differ only in capitalization (for example foo and Foo) refer to the same file. Since the hierarchy of files may be comprised of several filesystems, some components of the path may be case-sensitive while others are case-insensitive. As a rule of thumb, Unix filesystems are case sensitive while Microsoft filesystems are case-insensitive even when mounted on a Unix system.

Path components may contain every character except the Null character and the slash. In particular, space and newline characters are allowed. However, while dots are allowed in path components if they are used together with other characters, the path components . and .. have a special meaning: every directory contains two subdirectories named . and .. which refer to the directory itself and its parent directory, respectively.

Globbing

Globbing, also known as pathname expansion, is a pattern matching language for paths which was already present in the earliest Unix versions. The glob operation generates a set of valid paths from a glob pattern by replacing the pattern by all matching paths.

Glob patterns may contain special characters called wildcards. The wildcard characters are:

The complete syntax rules for glob patterns and the exact semantics for pattern matching are described in POSIX and in glob(7). Any POSIX-compliant shell performs globbing to construct the command to be executed from the line entered at the prompt. However, POSIX also demands system functions which make globbing available to other applications. These are implemented as part of libc.

There are a few quirks related to globbing which are worth to point out. First, if no valid path matches the given pattern, the expansion of the pattern is, by definition according to POSIX, the pattern itself. This can lead to unexpected results. Second, files which start with a dot (so-called hidden files) must be matched explicitly. For example, rm * does not remove these files. Third, the tilde character is no wildcard, although it is also expanded by the shell. See the exercises for more examples.

POSIX globbing has some limitations. For example, there is no glob pattern which matches exactly those files that start with an arbitrary number of a characters. To overcome these limitations, some shells extend the matching language by implementing extended glob patterns which are not covered by POSIX. For example, if extended globbing feature of bash(1) is activated via the extglob option, the extended glob pattern +(a) matches the above set of files.

File Types

We have seen that all but the last component of a valid path refer to directories while the last component may refer to either a file or a directory. The first character in the output of ls -l indicates the type of the last path component: for directories a d character is shown while files (also called regular files in this context) get a hyphen character (-). Besides directories and regular files, the following special file types exist:
Soft link (l)
A file which acts as a pointer to another file. We shall cover links in a dedicated subsection below.
Device node (c and b)
Also called device special. These files refer to devices on the local system. Device nodes come in two flavors: character devices (c) and block devices (b). Regardless of the flavor, each device node has a major and a minor number associated with it. The major number indicates the type of the device (e.g. a hard drive, a serial connector, etc.) while the minor number enumerates devices of the same type. On most systems the device nodes are created and deleted on the fly as the set of connected devices changes, for example due to a USB device being added or removed. However, device nodes can also be created manually with the mknod(1) command or the mknod(2) system call. Device nodes do not necessarily correspond to physical devices. In fact, POSIX demands the existence of a couple of virtual devices with certain properties. We look at some of these in the exercises. The access to device nodes which do correspond to physical devices is usually restricted to privileged users.
Socket (s)
Sockets provide an interface between a running program and the network stack of the kernel. They are subdivided into address families which correspond to the various network protocols. For example, the AF_INET and AF_INET6 address families are for internet protocols (IP) while AF_LOCAL (also known as AF_UNIX) is used for communication between processes on the same machine. These local sockets are also called Unix domain sockets. They can be bound to a path which refers to a file of type socket. Regardless of the address family, processes can exchange data via sockets in both directions, but the local sockets support additional features, like passing process credentials to other processes.
Fifo (p)
Files of type fifo are also known as named pipes. They associate a path with a kernel object that provides a First In, First Out data channel for user space programs. Data written to the fifo by one program can be read back by another program in the same order. Fifos are created with the mkfifo(1) command or the mkfifo(3) library function.

Note that the type of a file is never inferred from the path. In particular the suffix of the path (everything after the last dot) is just a convention and has no strict connection to the file type. Also there is no difference between text and binary files.

Metadata and Inodes

The stat(2) system call returns the metadata of the file or directory that corresponds to the given path. The stat(1) command is a simple program which executes this system call and prints the thusly obtained metadata in human-readable form, including the file type. This is done without looking at the contents of the file because metadata are stored in a special area called the inode. All types of files (including directories) have an associated inode. Besides the file type, the inode stores several other properties prescribed by POSIX. For example the file size, the owner and group IDs, and the access permissions are all stored in the inode. Moreover, POSIX requires to maintain three timestamps in each inode:

To illustrate the difference between the mtime and the ctime, consider the chgrp(1) command which changes the group ID of the file or directory identified by its path argument. This command sets the ctime to the current time while the mtime is left unmodified. On the other hand, commands which modify the contents of a file, such as echo foo >> bar, change both the mtime and the ctime.

The inode of each file or directory contains twelve mode bits, nine of which are the permission bits which control who is allowed to access the file or directory, and how. The permission bits are broken up into three classes called user (u), group (g) and others (o). Some texts refer to the first and last class as "owner" and "world" instead, but we won't use this naming to avoid confusion. Each class contains three bits. The bits of the "user" class apply to the file owner, that is, the user whose ID is stored in the inode. The "group" category applies to all non-owners who belong to the group whose ID is stored in the inode. The third category applies to all remaining users. The three bits of each class refer to read/write/execute permission. They are therefore named r, w and x, respectively. The permission bits mean different things for directories and non-directories, as described below.

   Directories Non-directories
r The permission to list the directory contents. More precisely, this bit grants the permission to call opendir(3) to obtain a handle to the directory which can then be passed to readdir(3) to obtain the directory contents. If read permission is granted, the open(2) system call does not fail with the permission denied error, provided the file is opened in read-only mode. The system call may fail for other reasons, though.
w The permission to add or remove directory entries. That is, to create new files or to remove existing files. Note that write permission is not required for the file that is being removed. Permission to open the file in write-only mode in order to perform subsequent operations like write(2) and truncate(2) which change the contents of the file. Non-directories are often opened with the intention to both read and write. Naturally, such opens require both read and write permissions.
x The permission to search the directory. Searching a directory means to access its entries, either by retrieving inode information with stat(2) or by calling open(2) on a directory entry. Run the file. This applies to binary executables as well as to text files which start with a shebang, #!, followed by the path to an interpreter. We shall cover file execution in more detail below.

To run the regular file /foo/bar/baz, search permission is needed for both foo and bar, and execute permission is needed for baz. Similarly, to open the regular file foo/bar for reading, we need execute permissions on the current working directory and on foo, and read permissions on bar.

A numeric permission mode is a three octal digit (0-7) number, where the digits correspond to the user, group, other classes described above, in that order. The value of each digit is derived by adding up the bits with values 4 (read), 2 (write), and 1 (execute). The following table lists all eight possibilities for each of the three digits.

Octal Value Symbolic Representation Meaning
0 --- no permissions at all
1 --x only execute permission
2 -w- only write permission
3 -wx write and execute permission
4 r-- only read permission
5 r-x read and execute permission
6 rw- read and write permission
7 rwx read, write and execute permission

The chmod(1) command changes the permission bits of the file identified by the path argument. For example, chmod 600 foo sets the permissions of foo to rw-------. Besides the octal values, chmod(1) supports symbolic notation to address the three classes described above: u selects the user class, g the group class, o the class of other users. The symbolic value a selects all three classes. Moreover, the letters r, w and x are used to set or unset the read, write and execute permission, respectively. The above command is equivalent to chmod u=rw,g=---,o=--- foo. The + and - characters can be specified instead of = to set or unset specific permission bits while leaving the remaining bits unchanged. For example chmod go-rw foo turns off read and write permissions for non-owners.

Unprivileged users can only change the mode bits of their own files or directories while there is no such restriction for the superuser.

Links make it possible to refer to identical files through different paths. They come in two flavors: hard and soft. Both types of links have advantages and disadvantages, and different limitations. We start with hard links because these existed already in the earliest Unix versions.

A file can have more than one directory entry that points to its inode. If two directory entries point to the same inode, they are said to be hard links of each other. The two entries are equivalent in that they refer to the same file. It is impossible to tell which of the two is the "origin" from which the "link" was created. Hard links are created with the link(2) system call or the ln(1) command. Both take two path arguments, one for the existing file and one for the directory entry to be created. The filesystem maintains in each inode a link counter which keeps track of the number of directory entries which point to the inode. The link(2) system call increases the link count while unlink(2) decrements the link count and removes the directory entry. If the decremented counter remains positive, there is still at least one other directory entry which points to the inode. Hence the file is still accessible through this other directory entry and the file contents must not be released. Otherwise, when the link counter reached zero, the inode and the file contents may be deleted (assuming the file is not in use).

There are several issues with hard links. For one, hard links can not span filesystems. That is, the two path arguments for link(2) have to refer to files which reside on the same filesystem. Second, it is problematic to create hard links to directories. Early Unix systems allowed this for the superuser, but on Linux the attempt to hard-link a directory always fails. To address the limitations of hard links, soft links, also called symbolic links (or symlinks for short), were introduced. A soft link can be imagined as a special text file containing a single absolute or relative path, the target of the link. For relative paths the implied leading part is the directory that contains the link. A soft link is thus a named reference in the global hierarchy of files. Unlike hard links, the soft link and its target do not need to reside on the same filesystem, and there is a clear distinction between the link and its target. Soft links are created with symlink(2) or by specifying the -s option to the ln(1) command.

A soft link and its target usually point to different inodes. This raises the following question: Should system calls which receive a path argument that happens to be a soft link operate on the link itself, or should they follow (or dereference) the link and perform the operation on the target? Most system calls follow soft links, but some don't. For example, if the path argument to chdir(2) happens to be a soft link, the link is dereferenced and the working directory is changed to the target of the link instead. The rename(2) system call, however, does not follow soft links and renames the link rather than its target. Other system calls, including open(2), allow the caller to specify the desired behaviour by passing a flag to the system call. For yet others there is a second version of the system call to control the behaviour. For example, lstat(2) is identical to stat(2), but does not follow soft links.

It is possible for a soft link to refer to an invalid path. In fact, ln(1) and symlink(2) do not consider it an error if the target does not exist, and happily create a soft link which points to an invalid path. Such soft links are called dangling or broken. Dangling soft links also occur when the target file is removed or renamed, or after a mount point change.

Soft links may refer to other soft links. System calls which follow soft links must therefore be prepared to resolve chains of soft links to determine the file to operate on. However, this is not always possible because soft links can easily introduce loops into the hierarchy of files. For example, the commands ln -s foo bar; ln -s bar foo create such a loop. System calls detect this and fail with the Too many levels of symbolic links error when they encounter a loop.

Another issue with both soft and hard links is that there is no simple way to find all directory entries which point to the same path (soft links) or inode (hard links). The only way to achieve this is to traverse the whole hierarchy of files. This may be prohibitive for large filesystems, and the result is unreliable anyway unless the filesystems are mounted read-only.

Exercises

Homework

How many paths are there that refer to the same file?

Solution

Given the path /foo/bar, one may construct different paths which refer to the same file by inserting any number of /. or ../foo after the first component. For example, /foo/./bar and /foo/../foo/bar both refer to the same file. If relative paths have to be taken into account as well, even more paths can be constructed easily. Hence the answer is: arbitrary many. This illustrates the fundamental difference between a path and a file. Paths can be mapped to files, but not the other way around. In particular, there is no such thing like "the list of paths which have changed since yesterday". The concept of hard- and soft links complicates the situation further.

Homework

Given two paths, how can one tell if they refer to the same file?

Solution

Among other information, the metadata record of each file contains the so-called inode number, which uniquely identifies the file within the file system that contains the file. Therefore, if both paths are known to refer to files stored on the same file system, a comparison of the two inode numbers is sufficient to tell whether the two paths refer to the same file. The inode number can be obtained with the command ls -i. In the general case one additionally has to check that the two device IDs which identify the underlying file systems are also identical. Like the inode number, the device ID is part of the metadata of the file. It can be obtained by running stat(1).

Homework

Device nodes come in two flavors: Character and block devices. Explain the difference between the two device flavors.

Homework

Homework

How many possible permission modes exist for a file or directory on a Unix System?

Solution

There are nine permission bits that can be turned on and off independently. Hence we have 2^9=512 possibilities. When taking into account the three special bits (sticky, setuid, setgid), the number increases to 2^12=4096.

Homework

Explain each command of the script below. Show the arrangement of all files and links in a figure, drawing a directory as a circle and a file as a square. How many different paths exist for the file a? Discuss whether the question "What's the path of a given file?" makes sense.

Solution

foo a testdir
Since foo/a, foo/testdir/a, foo/testdir/testdir/a etc. all refer to the same file, there are infinitely many paths for the file a. Hence the question makes no sense: There is no such thing as the path to a file.

Homework

Recall that the path component .. refers to the parent directory. Give an example of a path to a directory where "parent directory" and "directory identified by the path obtained by removing the last path component" are different. Which of the two interpretations of .. does bash apply when you type cd .. at the bash prompt?

Homework

Is it possible to choose among all possible paths which refer to the same file a canonical path? That is, a shortest (counting characters) absolute path which does not contain any soft links?

Solution

The POSIX standard requires each Unix system library to provide the realpath() function which performs the following substitutions on the given path: First, the path to the current working directory is prepended if the given path is relative (does not begin with a slash). Second, symbolic links are replaced by their targets. Third, any occurrences of /. and foo/.. are removed. The thusly transformed path is returned by the function as the canonical path.

Although each path can be canonicalized in this way, not all paths which refer to the same file give rise to the same canonical path. For example, /tmp/foo and /tmp/bar could refer to regular files which are hard links of each other. In this case the paths refer to the same file, yet the paths are different and already canonicalized. The same can happen when a file system (or a subtree of it) is bind mounted. That is, the file system tree is visible at two or more locations in the global directory tree.

The message of this exercise is to convince the reader that it is incorrect to assume that two files are different because their paths are different.

Processes

A program consists of instructions and data stored in a regular file. A user process is an instance of a running program. This is in contrast to kernel processes which are created directly by the kernel and have no relationship to executable files. Since we shall only be concerned with user processes, we will refer to these as "processes" from now on. In this section we'll see how processes are created and removed. We will then take a closer look at the enviroment of a process and discuss how processes communicate with each other.

Process Tree, Zombies and Orphans

When the system boots, there is only one process, the init process, which is created by the kernel at the end of the boot sequence by executing /sbin/init. All other processess are created from existing processes by means of the fork(2) system call. The process which called fork(2) is said to be the parent of the newly created child. After fork(2) returns, both parent and child are executing independently of each other. Both processes may call fork(2) again to create further processes. This gives rise to a tree structure where the processes are the nodes of the tree with init being the root node. The edges of the tree describe the parent-child relationships.

If there are more processes than CPUs, not all processes can run simultaneously. It is the mission of the kernel's task scheduler to assign processes to CPUs and to perform context switches. That is, to take away the CPU from a running process in order to give another process the chance to run. The scheduler has to choose the duration of each process' time slice and it must pick the process to switch to when the time slice of the current process has expired or the process gives up the CPU voluntarily, for example because it is waiting for an I/O operation to complete. This is a non-trivial task at least for modern multi-CPU systems with non-uniform memory access (NUMA) where the memory access times depend on the memory location and the processor core. Things don't get easier if the CPU clock speed can vary and/or scheduling must be power-aware to optimize battery time. To make good decisions, some information has to be provided by the processes or by a system-wide policy. One elementary way to prioritize certain processes over others is via nice levels which we shall discuss below.

The normal way for a process to terminate is to call exit(3) after it has done its work. This function transfers an integer value, the exit status, to the kernel. The exit status can only be retrieved by the parent of the terminating process. To illustrate this concept, imagine an interactive shell which creates one child each time the user enters a command. The children are short living while the parent, the shell, stays around for much longer. During command execution the parent needs to wait for its child to terminate. It may then want to tell whether the child has terminated successfully. To achieve this, the parent calls one of the wait system calls (wait(2), waitpid(2), waitid(2)) which block until the child has terminated, then return the child's exit status. After the child called exit(3) but before the parent has called one of the wait functions, the kernel needs to keep at least the exit status (and possibly further information) about the terminated child. During this time window the child has already terminated but a part of it still exists in kernel memory. Processes in this state are aptly called zombies.

Unlike in the shell scenario outlined above, a process might well have any number of children at the time it terminates. Its children then become orphans as they lose their parent. The kernel cannot simply remove the terminated process from the process tree because this would disconnect its orphaned children from the other processes in the tree, destroying the tree structure. To avoid this, orphans are reparented to init, that is, made children of the init process. This works because the init process never terminates.

There are several programs which show information about processes. The POSIX ps(1) command prints a list of processes. It has many options that control which processes are shown and how the output is formatted. Similar programs which are not covered by POSIX are pstree(1), top(1) and htop(1). The former shows the tree structure while the latter two provide a dynamic real-time view of the process tree. The exercises of this section invite the reader to become familiar with these essential programs.

File Execution

When a process calls fork(2), the newly created child process starts out as a copy of the calling process. However, the reason to create a new process is usually to let the child do something different than the parent. Therefore, fork(2) is often followed by a second system call which replaces the child process with a different program. There are several similar system calls which do this, with slight semantic differences. We refer to this family of system calls as the exec system calls.

All exec system calls receive a path argument from which they determine an executable file that contains the program to run. Linux and BSD store executables in Executable and Linkable Format (ELF). Executables are typically linked dynamically. That is, the dependent libraries (at least libc, but possibly many more) are loaded at runtime from separate files. This is in contrast to static linking which includes all dependencies in the executable, making the executable self-contained but also larger and harder to maintain. Regardless of the type of linking, when the program is loaded, it completely replaces the previously running program. Note that there can be more than one process at the same time which executes the same program.

Files in ELF format are called native executables because they contain machine instructions which can be executed directly by the CPU. Another type of executables are scripts, also known as interpreter files. Scripts are text files which start with the shebang (#!). They can not run directly but have to be interpreted at runtime by another program, the interpreter. Nevertheless, it is possible to execute a script as if it were a native executable: by passing the path to one of the exec system calls or by typing the path at the shell prompt. The exec system call recognizes that the given file is a script by investigating the first line, which contains the path to the interpreter after the shebang, optionally followed by options to the interpreter. The kernel then executes the interpreter rather than the script, passing the path to the script as an argument. For example, if /foo/bar is being executed, and the first line of this file is #!/bin/sh, the kernel executes /bin/sh /foo/bar instead. Popular interpreters besides /bin/sh include /bin/bash, /usr/bin/perl, /usr/bin/python and /usr/bin/awk.

File Descriptions and File Descriptors

The kernel must always be aware of the set of all objects which are currently in use. This set is often called the system-wide table of open files although not all entries refer to files. In fact, an entry may refer to any object that supports I/O operations, for example a network socket. Each entry is called a file description, which is a somewhat unfortunate term that was coined by POSIX. A file description records information about the object itself as well as the current state of the reference, including the file offset, if applicable, and the status flags which affect how future I/O operations are going to be performed through this reference.

The kernel maintains for each process an array of pointers to file descriptions. Each such pointer is a file descriptor. Unlike files and file descriptions, a file descriptor always corresponds to a process and is identified by a non-negative number, the index into the pointer array of that process. This index is returned by system calls like open(2) or socket(2). As far as user space programs are concerned, a file descriptor is synonymous with this integer. It can be regarded as an abstract handle that must be supplied to subsequent I/O operations like read(2) or write(2) to tell the system call the target object of the operation.

The shell automatically creates three file descriptors for each process which are identified by the integers 0, 1 and 2. They are called stdin, stdout, and stderr, which is short for standard input/output/error. It is possible, and in fact common, that all three file descriptors point to the same file description: the terminal device. Many command line tools read their input from stdin, write normal output to stdout, and error messages to stderr. For example, when the POSIX command cat(1) is invoked with no arguments, it reads data from stdin and writes the same data to stdout.

Signals

Signals are another ancient Unix concept that dates back to the early 1970s and was standardized in POSIX long ago. This concept facilitates a rudimentary form of inter process communication (IPC) between unrelated processes. Signals can be regarded as software interrupts that transmit a notification event from the sending process to the target process. The event is sent asynchronously, meaning that the interruption can happen at any location of the code flow.

It is fair to say that most non-trivial programs, including scripts, have to deal with signals. All major scripting languages (bash, python, perl, ruby, etc.) provide an API for signal handling. The interpreter of the scripting language ends up calling the POSIX system functions, so we will only look at these.

Signals are identified by name or by a numerical ID. For example, SIGINT (interrupt from keyboard) is the name for signal number 2. POSIX defines 31 standard signals plus at least eight real-time signals. The standard signals can be subdivided according to the origin of the signal as follows.

hardware related signals
These signals originate from hardware traps that force the CPU back into kernel mode. The kernel responds to the trap by generating a signal for the process that caused the trap. For example, a division by zero in a user space program triggers a hardware trap in the floating point unit (FPU) of the CPU. The kernel then generates the SIGFPE (floating-point exception) signal for the process. Another example for a signal that originates from a hardware trap is SIGSEGV (segmentation fault) which occurs when a process attempts to reference a memory address that has not been mapped (i.e., marked as valid) by the memory management unit (MMU) of the CPU.
kernel generated signals
Signals which originate from the kernel rather than from hardware. One example is SIGCHLD (child terminated), which is sent to the parent process when one of its child processes terminates. Another example is SIGWINCH (window resize), which is generated when the geometry of the controlling terminal of a process changes.
user-space generated signals
These signals can only originate from user space when a process, for example kill(1), calls raise(2) or kill(2) to instruct the kernel to generate a signal. Examples are SIGTERM, which issues a termination request, and SIGUSR1 and SIGUSR2 which are reserved for use by application programs.
The following signals are used frequently and deserve to the described explicitly. We refer to signal(7) for the full list of signals and their semantics.
SIGINT, SIGTERM and SIGKILL
All three signals terminate the process by default. SIGINT is generated for the foreground processes when the interrupt character (CTRL+C) is pressed in a terminal. For example, if CTRL+C is pressed while the shell pipeline find | wc is executing, SIGINT is sent to both processes of the pipeline. SIGTERM is the default signal for the kill(1) command. It requests the target process to run its shutdown procedure, if any, then terminate. SIGKILL instructs the kernel to terminate the target process immediately, without giving the process the chance to clean up. This signal can originate from a process or from the kernel in response to running out of memory. To keep the system working, the kernel invokes the infamous out of memory killer (OOM killer) which terminates one memory-hungry process to free some memory.
SIGSTOP, SIGTSTP and SIGCONT
SIGSTOP instructs the task scheduler of the kernel to no longer assign any CPU time to the target process until the process is woken up by a subsequent SIGCONT. SIGTSTP (stop typed at terminal) stops all foreground processes of a terminal session. It is generated when the stop character (CTRL+Z) is pressed in a terminal.

Processes may set the signal disposition of most signals to control what happens when the signal arrives. When no disposition has been set, the signal is left at its default disposition so that the default action is performed to deliver the signal. For most signals the default action is to terminate the process, but for others the default action is to ignore the signal. If the signal is neither ignored nor left at its default disposition, it is said to be caught by the process. To catch a signal the process must tell the kernel the address of a function, the signal handler, to call in order to deliver the signal. The set of signal dispositions of a process can thus be imagined as an array of function pointers with one pointer per possible signal. If the process catches the signal, the pointer points to the corresponding signal handler. A NULL pointer represents a signal that was left at its default disposition while the special value SIG_IGN indicates that the process explicitly asked to ignore this signal.

Signals can also be blocked and unblocked. When a signal is generated for a process that has it blocked, it remains pending. Pending signals cause no action as long as the signal remains blocked but the action will be performed once the signal gets unblocked. SIGKILL and SIGSTOP can not be caught, blocked, or ignored.

Real-time signals are similar to SIGUSR1 and SIGUSR2 in that they have no predefined meaning but can be used for any purpose. However, they have different semantics than the standard signals and support additional features. Most importantly, real-time signals are queued, meaning that in contrast to standard signals the same real-time signal can be pending multiple times. Also, the sending process may pass an accompanying value along with the signal. The target process can obtain this value from its signal handler along with additional information like the PID and the UID of the process that sent the signal.

Some system calls including read(2) and write(2) may block for an indefinite time. For example, reading from a network socket blocks until there is data available. What should happen when a signal arrives while the process is blocked in a system call? There are two reasonable answers: Either the system call is restarted, or the call fails with the Interrupted system call error. Unfortunately, different flavors of Unix handle this case differently by default. However, applications may request either behaviour by setting or clearing the SA_RESTART flag on a per-signal basis.

Environment of a Process

Now that we have a rough understanding of processes we look closer at the information the kernel maintains for each process. We already discussed the array of file descriptors and the array of signal dispositions. Clearly both are process specific properties. As we shall see, there is much more what constitutes the environment of a process.

Each process is identified by a unique process ID (PID), which is a positive integer. The init process is identified by PID 1. PIDs are assigned in ascending order, but are usually restricted to the range 1..32767. After this many processes have been created, PIDs wrap and unused PIDs are recycled for new processes. Thus, on a busy system on which processes are created and terminated frequently, the same PID is assigned to multiple processes over time.

Not all processes call fork(2) to create a child process, but each process except the init process has a unique parent. As described before, this is either the "real" parent (the process which created the process earlier) or the init process that "adopted" the orphaned process in case the real parent terminated before the child. The process ID of the parent (PPID) is thus well-defined. A process can obtain its PID and PPID with the getpid(2) and getppid(2) system calls.

Each process runs on behalf of a user (possibly the superuser) which is identified by its user ID (UID) and belongs to one or more groups, identified by one or more group IDs (GIDs). The superuser is identified by UID zero. When we talked about the permission bits of files and directories, we said that suitable permissions are needed for system calls which operate on files (open(2), stat(2), etc.). A more precise statement is that the process which calls, say, open(2) needs to have these permissions. To decide this, the kernel needs to take into account the UID and GIDs of the process that called open(2), the UID and the GID stored in the inode of the file that is being opened, and the permission bits of this file. The UID is also taken into account for kill(2) because unprivileged processes (non-zero UID) can only send signals to processes which run on behalf of the same user while the superuser may target any process.

Each process has a current working directory (CWD) associated with it. When the user logs in, the CWD of the login shell process is set to his home directory, which should always exist and have the read, write and execute permission bits set for the user. The CWD can later be changed with chdir(2) and be retrieved with getcwd(3). The CWD is used as the starting point for path searches for relative paths. It affects most system calls which receive a path argument. For example, if the CWD is /foo/bar and the relative path baz/qux is passed to open(2), the kernel will attempt to open the file which is identified by /foo/bar/baz/qux.

Many programs accept arguments to control their behavior. In addition to the path to the program that is to be executed, all variants of the exec system calls receive an array of arguments called the argument vector. For example, when the command ls -l foo is executed, the argument vector contains the two strings "-l" and "foo". Note that the argument vector is not part of the program but is tied to the process. It is passed to the main function at startup so that the program may evaluate it and act accordingly.

Another way to pass information to a program is via environment variables. Environment variables are strings of the form name=value. POSIX describes the API to maintain the environment variables of a process. Environment variables are set with setenv(3) or putenv(3), the value of a variable can be retrieved with getenv(3), and a variable and its value can be deleted with unsetenv(3). The set of environment variables is sometimes called the environment of the process, although we use this term in a broader sense to describe the entirety of all metadata maintained by the kernel about the process, including but not limited to environment variables.

Each process also has about a dozen resource limits that can be set and queried with the POSIX setrlimit(2) and getrlimit(2) functions. Each limit controls a different aspect. For example, RLIMIT_CPU limits the CPU time the process is allowed to use and RLIMIT_NOFILE controls how many open files it may have at a time. For each resource there is a soft and a hard limit. The kernel enforces the value stored as the soft limit. This value may be set by an unprivileged process to any value between zero and the hard limit. Unprivileged processes may also reduce (but not increase) their hard limits. Once a hard limit is reduced, it can not be increased any more. For RLIMIT_CPU a special convention applies: If the soft limit is reached, the kernel sends SIGXCPU (CPU time limit exceeded) to notify the process about this fact so that it can terminate orderly (e.g., remove temporary files). When the hard limit is reached, the kernel terminates the process as if it received SIGKILL.

The nice level of a process provides a hint for the task scheduler to give the process a lower or higher priority relative to other processes. Nice levels range between -20 and 19. A high nice level means that the process wants to be nice to other processes, that is, should run with reduced priority. Low nice levels indicate important processes that should be prioritized over other processes. The default nice level is zero. Unprivileged users may set the nice level of new processes with the nice(1) command to any non-negative value. They can also increase the nice level of their existing processes with renice(1), but never decrease it. The superuser, however, may set the nice level of any process to an arbitrary value in the valid range.

The bulk of the properties discussed so far are inherited by the child after a fork(2). Specifically, the child gets the same array of file descriptors and signal dispositions as its parent, runs on behalf of the same user, has the same working directory, the same resource limits and nice level, and also the same set of environment variables with identical values. The PID and PPID, however, are different of course.

After a process has called an exec function to replace itself with a new program, its signal handlers no longer exist because they were part of the program code which has been replaced. Therefore the exec system calls reset the disposition of all signals that were caught to the default disposition. Signals that were being ignored keep being ignored, however.

The Process Filesystem

Although not covered by POSIX, at least Linux, NetBSD and FreeBSD provide information about processes via the process filesystem (procfs), which is usually mounted on /proc. The process filesystem is a pseudo-filesystem, i.e., it has no underlying storage device. Files and directories are faked by the kernel as they are accessed. Each process is represented by a numerical subdirectory of /proc which is named by the PID. For example, /proc/1 represents the init process. The aforementioned process utilities (ps(1), top(1), etc.) read the contents of the process filesystem in order to do their job.

Each /proc/[pid] directory contains the same set of files although this set is different between Unixes. These files expose much of the environment of the process to user space. The Linux procfs implementation provides text files named environ and limits which contain the current environment and the resource limits of the process, respectively. Moreover, the file descriptor array of each process is exposed in the files of the /proc/[pid]/fd directory. Linux and NetBSD (but not FreeBSD) also provide a cwd soft link which points to the current working directory of the process.

Pipes and Redirections

The pipe(2) system call takes no arguments and creates two file descriptors for the calling process which are tied together as a unidirectional first in, first out data channel. One file descriptor is the read end of the pipe, the other is the write end. Data written to the write end is buffered by the kernel and can be obtained by reading from the read end.

One application of pipes is communication between related processes. A process first creates a pipe, then calls fork(2) to create a child process. The child inherits a copy of both pipe file descriptors. Hence the parent process can communicate with the child by writing a message to the write end of the pipe for the child to read.

This approach depends on file descriptor inheritance across fork(2), so it does not work in the situation where neither process is an ancestor of the other. Files of type fifo (named pipes) overcome this restriction. To establish a connection between two unrelated processes, both processes call open(2) to obtain a file descriptor which is associated with the fifo. One process passes the O_WRONLY flag to open the file for writing while the other passes O_RDONLY to open it for reading. The two processes may then communicate in the same way as with the pipe(2)/fork(2) approach.

The POSIX dup(2) and dup2(2) system calls allow a process to manipulate the entries of its file descriptor array. In particular the standard file descriptors 0, 1, and 2 can be replaced. By doing so before performing an exec system call, it can be arranged that the replacement program starts with, say, its stdout file descriptor be redirected to the write end of a pipe. Note that the replacement program does not need any modifications for this to work, and might not even be aware of the fact that it is not writing its output to the terminal as usual.

Shells employ this technique to implement the | operator which "pipes" the output of one command "into" another command. For example, the pipeline ls | wc works as follows: First the shell creates a pipe, then it calls fork(2) twice to create two processes which both get a copy of the two pipe file descriptors. The first process replaces its stdout file descriptor with the write end of the pipe and performs an exec system call to replace itself with the ls(1) program. The second process replaces its stdin file descriptor with the read end of the pipe and replaces itself with wc(1). Since ls(1) writes to stdout and wc(1) reads from stdin, wc(1) processes the output of ls(1).

Stdio

The POSIX standard requires a compliant Unix system to provide a collection of functions that let applications perform input and output by means of operations on streams. This programming interface, known as stdio for standard input/output, is part of every Unix system since 1979. Every program which contains a printf(3) statement relies on stdio.

The stdio functions are implemented as part of libc on top of the open(2), read(2) and write(2) system calls which are implemented in the kernel. Roughly speaking, stdio replaces the file descriptor API by a more abstract API which centers around streams. A stream is an opaque data structure which comprises a file descriptor and an associated data buffer for I/O. Each program has three predefined streams which correspond to the three standard file descriptors (stdin, stdout and stderr). The stdio API contains well over 50 functions to create and maintain streams and to perform I/O on streams. These functions take care of the characteristics of the underlying file description. For example, they automatically try to select the optimal I/O buffer size.

Many applications rely on stdio because of convenience. For one, the buffers for read(2) and write(2) must be allocated and freed explicitly by the application, and care must be taken to not overflow these buffers. With stdio, this task is done by the stdio library. Second, formatted I/O is much easier to do with the stdio functions because the programmer only has to provide a suitable format string to convert between the machine representation and the textual representation of numbers. For example, by passing the format string "%f" to scanf(3), the programmer tells the stdio library to read a floating point number stored in textual representation from the specified stream, convert it to machine representation and store it in the given variable. The fprintf(3) function works the other way round: the value is converted from machine representation to text, and this text is written to the stream. Both functions can deal with various formats, like scientific notation for floating point numbers (e.g., 0.42E-23). With stdio it is easy to customize the formatted output, for example add leading zeros or select the number of decimal digits in the textual representation of a floating point number.

Another reason why many programs rely on stdio is that it performs buffered I/O. With buffered I/O not each read/write operation results in a system call. Instead, data read from or written to the stream is first stored in the user space buffer that is associated with the stream. This is a performance improvement for applications which perform many small I/O operations because every system call incurs some overhead. Buffers may be flushed explicitly by calling fflush(3), or implicitly by the stdio library. The criteria which cause an implicit flush depend on the buffering type of the stream. Three buffering types exist.

unbuffered
The stdio library does not buffer any reads or writes. Instead, each I/O operation results in a read(2) or write(2) system call. By default the stderr stream is unbuffered to display error messages as quickly as possible.
line buffered
System calls are performed when a newline character is encountered. This buffering type applies by default to interactive sessions where the file descriptor of the stream refers to a terminal device (as determined by isatty(3)).
fully buffered
I/O takes place only if the buffer of the stream is empty/full. By default, if the file descriptor refers to a regular file, the stream is fully buffered. POSIX requires that stderr is never fully buffered.

The exercises on stdio focus on the three different buffering types because this is a common source of confusion.

The Virtual Address Space of a Process

Isolation refers to the concept that each process gets its own virtual address space. A rough understanding of the memory management system and the layout of the virtual address space of a process helps to locate the source of common problems like the infamous segmentation fault error, and to realize that putatively simple questions such as "how much memory is my process currently using?" are in fact not simple at all, and need to be made more precise before they can be answered.

Virtual memory is an abstraction of the available memory resources. When a process reads from or writes to a memory location, it refers to virtual addresses (illustrated as the left box of the diagram). Virtual addresses are mapped by the MMU to physical addresses which refer to physical memory locations (right box). The mapped virtual address space of a process is a collection of ranges of virtual addresses which correspond to physical memory addresses (blue areas). By storing less frequently-accessed chunks of virtual memory (yellow) on the swap area (grey), applications can use more memory than is physically available. In this case the size of the valid virtual addresses (blue and yellow areas together) exceeds the amount of physical memory (orange and green areas). Any attempt to access an unmapped memory location (red and yellow areas) results in a page fault, a hardware trap which forces the CPU back into kernel mode. The kernel then checks whether the address is valid (yellow) or invalid (red). If it is invalid, the kernel sends SIGSEGV, which usually terminates the process with the segmentation fault error. Otherwise it allocates a chunk of unused physical memory, copies the chunk from the swap area to the newly allocated memory and adjusts the mapping (i.e., a yellow part becomes blue). The virtual memory concept increases stability and security because no process can access physical memory which belongs to the kernel or to other processes (orange areas).

We've already seen that the fork(2) system call creates a new process as a duplicate of the calling process. Since the virtual address space of the calling process (a) might be large and (b) is likely to be replaced in the child by a subsequent call to an exec function, it would be both wasteful and pointless to copy the full address space of the parent process to the child. To implement fork(2) efficiently, operating systems employ an optimization strategy known as Copy on Write (CoW). The idea of CoW is that if multiple callers ask for resources which are initially indistinguishable, you can give them pointers to the same resource. This function can be maintained until a caller tries to modify its copy of the resource, at which point a true private copy is created to prevent the changes becoming visible to everyone else. The primary advantage is that if a caller never makes any modifications, no private copy needs ever be created. The fork(2) system call marks the pages of the virtual address space of both the parent and the child process as CoW by setting a special bit in the page table entry which describes the mapping between virtual and physical addresses of the MMU. As for invalid memory accesses, the attempt to write to a CoW page results in a page fault that puts the CPU back into kernel mode. The kernel then allocates a new memory page on behalf of the process, copies the contents of the page which caused the fault, changes the page table mappings for the process accordingly and returns to user space. This all happens transparently to the process.

2^64 - 1 Environment base pointer Stack break point Empty Heap BSS Data 0 Text

The diagram on the left illustrates the layout of the virtual address space of a process. At the top of the address space are the argument vector and the environment variables. The stack stores the local variables of the functions which are currently being called, plus house-keeping data like the return addresses of these functions. As more functions are called, the stack grows downwards towards the lower addresses. Its current lower end is called the base pointer. The other variable area of the address space is the heap, which contains the memory that has been allocated on behalf of the process, for example with malloc(3). As the process allocates more memory, the heap grows upwards, towards the stack. The current end of the heap is called the break point. The lower part of the address space contains three segments of fixed size. The text segment contains the compiled machine instructions of the executable, the data segment contains the initialized variables which are already known at compile time. Finally, the BSS segment is allocated and zeroed at execution time. This segment contains variables which should initialized to zero at startup. Unlike the data segment it is not stored in the executable. BSS stands for "Block Started by Symbol", which is a historic name coined in the 1950s. It has no relation to the real meaning of the segment.

The exercises of this section invite the reader to look at the virtual address space of running processes to learn what happens when a dynamically-linked executable is being executed and how the resulting memory maps affect the virtual address space of the newly created process.

Exercises

Homework

Explain how PR_SET_CHILD_SUBREAPER works and possible use-cases for this (Linux) feature.

Homework

Explain in one paragraph of text the purpose of the file creation mask (also known as umask) of a process.

Homework

When we said that each process runs on behalf of a user and that the ID of this user is part of the process metadata, we were simplifying matters. There are actually three different UIDs and three different GIDs: the real UID, the effective UID, and the saved set-user ID, and analogous for the group IDs. Explain the purpose of the three UIDs.

Homework

On a multi-CPU system the performance of a program can be enhanced by allowing for multiple flows of control. This is the idea behind threads, which are also called lightweight processes. Give an overview of threads, summarize the POSIX thread API (see pthreads(7)) and explain how the Linux-specific clone(2) system call can used to implement threads.

Homework

Explain what the command find /etc > /dev/null does, and why you get some error messages. Assume you'd like to extract only those error messages which contain the string "lvm". Explain why find /etc > /dev/null | grep lvm does not work as expected. Come up with a similiar command that works.

Solution

The command traverses the /etc directory recursively and prints all files and directories it encounters during the traversal to stdout. Since stdout is redirected to the NULL device by the > /dev/null construct, only the stderr stream containing the error messages makes it to the terminal. This includes all subdirectories of /etc which cannot be traversed due to insufficient permissions (no "r" bit set). The proposed find | grep command does not work since the | operator is evaluated before any redirections specified by the find command take place. More precisely, stdout of the find process is redirected twice: First to one end of the pipe due to the |, then to the NULL device due to the > /dev/null. The last redirection "wins", so the grep process does not see any input. The command find /etc 2>&1 > /dev/null | grep lvm works. The following four redirections take place: First stdout of the find process and stdin of grep process are redirected to the two ends of the pipe. Next, due to the 2>&1 the stderr stream of the find process is redirected to the current destination of stdout, i.e., to the pipe. Finally the > /dev/null redirects stdout of the find process to the NULL device. Hence error messages go to the pipe and are processed by grep as desired.

Homework

Run ulimit -n to see the maximal number of file descriptors you are allowed to create. Explain what this limit means with respect to multiple processes, multiple logins, and the fork(2) system call. Write a program in your language of choice which creates file descriptors in a loop until it fails due to the file descriptor limit. Then print the number of file descriptors the program was able to create.

Solution

On our systems the limit is set to 1024. This means a single process can only have this many files open at any given time. Independent processes (like those coming from different login sessions) have no common file descriptors, even though they may open the same files. In this sense the file descriptor limit is a per-process limit. However, when a process calls fork() to create a new process, the new process inherits all open file descriptors from the parent. This can lead to the situation where a newly created process is unable to open any files. This property was actually used to break computer security. The O_CLOEXEC flag was introduced not too long ago to deal with this problem. See open(2) for details. C program that opens the maximal possible number of file descriptors:
	int main(void)
	{
		int i;

		for (i == 0; open("/dev/null", O_RDONLY) >= 0; i++)
			;
		printf("opened %d file descriptors\n", i);
		exit(0);
	}

Homework

Search the web for the document called vm/overcommit-accounting. Discuss the pros and cons of the three possible overcommit handling modes.

Homework

Read this blog posting on the virtual memory overcommit issue. Explain the catch-22 situation described there in no more than two sentences.

Homework

Describe, in a single paragraph of text, what a virtual dynamic shared object (VDSO) is and which type of applications benefit most from it.

Homework

Describe the concept of huge pages and the Linux-specific implementation of transparent huge pages. Discuss the pros and cons of huge tables and explain the workloads which would benefit from a setup with huge pages enabled.

Homework

Supplements

cd_script

	#!/bin/sh
	echo "changing CWD to $1"
	cd "$1"

hello_world

	#!/bin/sh
	echo "hello world"
	#!/bin/sh
	mkdir foo
	touch foo/a
	ln -s ../foo foo/testdir
	ls -l foo/a foo/testdir/a foo/testdir/testdir/a

Further Reading