Unix ConceptsUnix is user-friendly. It's just very selective about who its friends are. -- Unknown |
Unix was created in 1969 as a successor of Multics, the MULTiplexed Information and Computing Service, which had been in use since the mid 1960s as the successor of CTSS, the Compatible Time-Sharing System of the early 1960s. Multics aimed to get CTSS right, but failed in this regard and was eventually discontinued because of its complexity. The Unix approach was very different as it was brainstormed by only three people and then implemented by Ken Thompson at Bell Laboratories in two days. Unlike its predecessors it focused on elegance and simplicity. The name was originally spelt UNICS (UNiplexed Information and Computing Service) to emphasize the contrast to Multics.
The original Unix implementation was written in assembly
language for the 18 bit processor of the PDP-7 "minicomputer", a
device of the size of a wardrobe which was considered small by the
standards of the day. Like all computers of this era, the PDP-7 was
not connected to a video screen. Instead, input had to be typed in on
the console, a device which looked much like an electric
typewriter. Output from the computer was printed on rolls of paper.
Since the assembly instructions could not easily be ported to different
hardware, Dennis Ritchie invented the C programming language in
1971. By 1973 the complete Unix implementation had been rewritten in C.
The C language was another corner stone in the history of Unix which
turned out to be very successful. While other programming languages
of that time have long been abandoned or play merely a niche role, C
is still one of the most widely used programming languages today. The
first Unix application was roff
, a typesetting program
which is still ubiquitous as the manual pages which ship with every
Unix system are formatted with roff.
From the beginning, Thompson and his early collaborators encouraged close communication between programmers, creating an early form of community. Up to the present day, this "hacker culture" has been a stimulus for countless improvements. Copies of Unix were distributed on tapes, hand-signed with "Love, Ken". Over time many universities contributed to Unix. By the end of the 1970s, Unix accumulated a whole bunch of utilities that made it a fully flavored operating system which was also free of any copyright claims.
Despite the primitive hardware of the time, the early Unix was remarkably similar to modern Linux systems. For example, the task scheduler, the hierarchical filesystem tree and the shell already existed back then.
The Advanced Research Projects Agency (ARPA) was a military research unit that was part of the USA's department of defence. It was established in the early 1960s with the mandate to create systems that could survive a nuclear war. The agency created the arpanet, the predecessor of today's internet, which was designed to stay operational after subordinate network losses. By the end of the 1960s and the early 1970s, the fundamental networking protocols were established: telnet for remote login was standardized in 1969, email (SMTP) in 1971, and the file transfer protocol (FTP) in 1973.
By the end of the 1970s many Unix installations existed in all parts of the world. However, the arpanet was mostly powered by commercial Multics systems because Unix only had rudimentary network support (UUCP, the Unix to Unix copy) which could copy files over telephone lines via modems but not much more. This changed in 1983 when TCP/IP networking was developed by the ARPA to replace the arpanet. Unix support for TCP/IP was developed at Berkeley University which had become the "Mecca" for Unix development and started already in 1977 to release their own Unix system named BSD, the Berkeley Software Distribution.
With excellent networking support and no licensing issues, it was only a matter of time until companies became interested in Unix in order to make money. Several companies started to commercialize Unix by adding features to the common code base but keeping their improvements closed, effectively stopping the source code from being freely distributable. At the same time Microsoft began to sell their DOS operating system, targeting small businesses and the home market. DOS lacked many features that Unix already had for a decade, like multi-tasking and multi-user support, but it did run on the cheap Intel 286 processors that were too weak for Unix.
By 1985 the commercialization of Unix and the success of Microsoft
had damaged the Unix community badly. But also the various companies
that sold their particular proprietary Unix brand realized that
too many incompatible Unix implementations would only hurt their
business. That's where the Computer Society of the Institute of
Electrical and Electronics Engineers (IEEE) became involved
in Unix. The IEEE is an organization which was already founded in
1946 to advance the theory, practice, and application of computer
technology. This organization created POSIX, the Portable Operating
System Interface for Unix, which is a family of specifications
for maintaining compatibility between operating systems. The first
version of POSIX was published in 1988. It covered several command
line utilities including vi(1)
and awk(1)
,
the shell scripting language, application programmer interfaces (APIs)
for I/O (input/output) and networking, and more. Up to the present
day POSIX is maintained by the IEEE and new revisions of the POSIX
standard are published regularly.
In 1983 Richard Stallman launched the GNU project and the Free Software Foundation as a reaction to the ongoing commercialization of Unix. GNU, which is is a recursive acronym for "GNU's not Unix", aimed to keep the Unix source code free, or to replace non-free parts by open source equivalents. To this aim the GNU project created the GNU General Public License (GPL), which requires not only the source code to stay free, but also that all subsequent modifications to the code base remain free. By the end of the 80s, the GNU toolset had become a full developer software stack licensed under the GPL. This set of software packages was complemented by the X window system, which was also released under a free license and enabled programmers to build graphical applications for desktop systems. Moreover, the first open source scripting language, perl, was released in 1987.
In 1985 Intel announced the 386 processor which, unlike its 286 predecessor, was powerful enough to run Unix. There were efforts to port the Unix operating system kernel to this hardware, but these efforts were impaired by pending lawsuits about who owns the copyright on the BSD source code. Due to the unclear legal situation of the BSD code, the major missing piece in the GNU toolset was a free operating system kernel. This hole was filled in 1991 when Linus Torvalds, a student from Helsinki in Finland, announced the first version of his Linux kernel.
Linux did not repeat the licensing problems of the original Unix because the Linux source code was written from scratch and licensed under the GPL. Due to this difference many developers moved from Unix to Linux, so Linux grew quickly and started soon to outperform the commercial Unix kernels in almost every benchmark. The cheap 386 hardware, the Linux kernel, the GNU toolset and the graphical user interface based on the X window system facilitated cheap workstations which ran a complete open source software stack.
The success of Linux, or GNU/Linux as some prefer to call it for reasons that should now be clear, steadily increased over time. In 2003 the SCO group, a company which sold a proprietary Unix system, was unhappy about this progress and sued IBM, which offered various Linux products. SCO claimed to be the owner of Unix, and that Linux contained "millions of lines" of code copied from Unix. SCO's lawyers argued that the success of Linux originated from this theft of intellectual property and asked for $5 billion as compensation for the resulting losses. The company also tried to collect taxes from other Linux users. Microsoft funded SCO in these efforts.
In the end SCO lost the lawsuit since it was evident that all that copied code never existed. In fact, the court ruled that SCO did not even own the Unix copyrights to begin with. Another fun fact is that the large number of bugs in the early Linux code actually helped to prove that Linux was original work. The long term effects of this lawsuit, an improved position of Linux and its ecosystem, last until the presence. Commercial Unix systems have become irrelevant as Linux runs on a wide variety of machines ranging from supercomputers to workstations, smart phones and IOT (internet of things) devices with very limited resources.
While SCO went bankrupt eventually, some of the companies which almost killed Unix by maximizing their own profit still exist, and make money with Linux today. However, they had to adjust their business model in order to comply with the GPL. Rather than selling proprietary software, they bundle open source software and sell support to paying customers. Some companies also sell hardware with Linux pre-installed.
A Linux Distribution is a conglomeration of free software, including the Linux kernel, the GNU toolset and the X window system, plus possibly other, proprietary software on top of that. Usually a distribution also includes an installer and a package manager to make it easy to install and update packages according to the users' needs.
There are hundreds of Linux distributions, and new distributions are created all the time while others are discontinued. Many distributions are backed by companies which target specific classes of users or hardware, but there are also non-commercial Linux distributions which are solely driven by a community of volunteers.
One of the most popular company-backed Linux distributions is Ubuntu, which is led since 2004 by the UK-based Canonical Ltd. It targets unskilled desktop users which would like to switch away from Microsoft Windows. One reason for the popularity of Ubuntu is that it is very easy to install on standard desktop and laptop hardware. A distinguishing feature of Ubuntu is its strict release cycles: New versions are released in April and October of each year, and every fourth release is a long-term support (LTS) release which will be supported for at least five years. Ubuntu also features a variant for server hardware which contains a different Linux kernel and ships with most desktop packages excluded.
The main community-driven Linux distribution is Debian. The Debian project was founded in 1993 and the first stable version was released in 1996. Debian is used as the basis for many other distributions. In fact, Ubuntu is based on Debian. The development of Debian closely follows the Unix culture in that it is developed openly and distributed freely. A team of about 1000 core developers work together with countless package maintainers according to the Debian Social Contract, the Debian Constitution, and the Debian Free Software Guidelines.
uname -a
on various Unix machines to see the
OS type and the kernel version. aptitude search
python
to list all python-related Ubuntu packages. Run
aptitude show python3-biopython
to see the description
of the biopython package. Repeat with different search patterns and
packages. After having briefly reviewed the history of Unix, we now look closer at the various components which comprise a Unix system and which distinguish Unix from other operating systems. We focus on general design patterns that have existed since the early Unix days and are still present on recent Linux systems.
The most striking difference between Unix and Windows is perhaps that on Unix the files of all devices are combined to form a single hierarchy with no concept of drive letters. When the system boots, there is only one device, the root device, which contains the root directory. To make the files of other devices visible, the mount operation must be employed. This operation attaches the file hierarchy of the given device to the existing hierarchy at a given location which is then called the mountpoint of the device. Mountpoints are thus the locations in the hierarchy where the underlying storage device changes.
The root directory contains a couple of well-known subdirectories, each of which is supposed to contain files of a certain type or for a certain purpose. The following table lists a subset:
/bin
: Essential commands for all users /sbin
: Essential system binaries /lib
: Essential libraries /usr
: Non-essential read-only user data /etc
: Static configuration files /home
: Home directories /tmp
: Temporary files /run
: Files which describe the state of running programs /var
: Log and spool files The Filesystem Hierarchy Standard describes the various subdirectories in more detail. The exercises ask the reader to become acquainted with this directory structure.
The Filesystem Hierarchy Standard lists /bin
and /sbin
and several other directories for executable
files. The POSIX standard defines which executables must exist in one
of these directories for the system to be POSIX-compliant. Well over
100 POSIX commands are listed in the XCU volume of this
standard. Besides the names of the commands, the general behaviour
of each and the set of command line options and their semantics are
described. POSIX versions are designed with backwards compatibility
in mind. For example, a new POSIX version might require a command
to support additional command line options, but existing options are
never dropped and never change semantics in incompatible ways. The
target audience of the POSIX document are programmers who implement
and maintain the POSIX commands and users which want to keep their
software portable across different Unix flavors.
One of the POSIX commands is the shell,
/bin/sh
, an interpreter that reads input expressed in
the shell command language, which is also part of POSIX.
The shell transforms the input in various ways to produce commands
and then executes these commands. The user may enter shell code
(i.e., code written in the shell command language) interactively at
the command prompt, or supply the input for the shell as a
shell script, a text file which contains shell code. Shell
scripts which only contain POSIX commands and use only POSIX options
are portable between different shell implementations and between
different Unix flavors. They should therefore never cease to work after
an upgrade. Among the many available POSIX shell implementations,
GNU bash is one of the more popular choices. Bash is fully
POSIX compatible and offers many more features on top of what is
required by POSIX.
Several implementations of the POSIX commands exist. On Linux the GNU implementation is typically installed while FreeBSD, NetBSD and MacOS contain the BSD versions. Although all implementations are POSIX-compliant, they differ considerably because different implementations support different sets of additional features and options which are not required by POSIX. These extensions are not portable, and should thus be avoided in shell scripts that must work on different Unix flavors.
In addition to the POSIX commands, a typical Unix system might well contain thousands of other commands. This illustrates another aspect that is characteristic for Unix: Tools should do only one specific task, and do it well. The operating system provides mechanisms to combine the simple commands in order to form more powerful programs. For example, commands can be chained together so that the output of one command becomes the input for the next command in the chain. This is the idea behind pipes, a Unix concept which dates back to 1973 and which is also covered by POSIX. We shall come back to pipes and related concepts in a later section.
From the very beginning Unix was designed to be a multi-user and a multi-tasking operating system. That is, it could run multiple programs on behalf of different users independently of each other and isolated from each other. This design was chosen to improve hardware utilization and robustness. In contrast, DOS and early versions of Windows were designed for personal computing (PC) and had no notion of user accounts, access permissions or isolation. This resulted in an unstable system because a single misbehaving program was enough to take down the whole system. Therefore these features had to be retrofitted later.
While multi-tasking makes all tasks appear to run simultaneously even if there are more tasks than CPUs, isolation refers to memory protection, a mechanism which prevents applications from interfering with each other and with the internals of the operating system. A running Unix system maintains two sets of running tasks: besides the application tasks there is also a set of kernel tasks. Unlike the application tasks, the kernel tasks are privileged in that they can access the memory of the application tasks while applications tasks can only access their own memory. Isolation is achieved by a hardware concept called protection domains, which existed already in Multics and thus predates Unix. In the simplest case, there are only two protection domains: a privileged domain called ring 0 for the kernel tasks, and an unprivileged domain for application tasks, also called user processes in this context. The CPU is always aware of the current protection domain as this information is stored in a special CPU register.
Only when the CPU is running in the privileged ring 0 domain (also known as kernel mode, as opposed to user mode for application tasks), it can interact directly with hardware and memory. If an application wants to access hardware, for example read a data block from a storage device, it can not do so by itself. Instead, it has to ask the operating system to perform the read operation on behalf of the application. This is done by issuing a system call. Like function calls, system calls interrupt the current program, continue execution at a different address and eventually return to the instruction right after the call. However, in addition to this, they also cause the CPU to enter kernel mode so that it can perform the privileged operation. When the system call has done its work and is about to return to the application, the protection domain is changed again to let the CPU re-enter user mode.
The system calls thus define the interface between applications
and the operating system. For backwards compatibility it is of utmost
importance that system calls never change semantics in an incompatible
way. Moreover, system calls must never be removed because this would
again break existing applications. The syntax and the semantics
of many system calls are specified in POSIX, although POSIX does
not distinguish between functions and system calls and refers to
both as system functions. This is because system calls are
typically not performed by the application directly. Instead, if an
application calls, for example, read()
, it actually calls
the compatibility wrapper for the read
system call which
is implemented as a function in the C POSIX Library (libc),
which ships with every Unix system. It is this library which does the
hard work, like figuring out which system calls are supported on the
currently running kernel, and how kernel mode must be entered on this
CPU type. Like the POSIX commands, the system functions described in
POSIX never change in incompatible ways, so programs which exclusively
use POSIX system functions are portable between different Unix flavors
and stay operational after an upgrade.
On a multi-user system it becomes necessary to configure programs
according to each user's personal preferences. The Unix way to
achieve this is to provide four levels of configuration options
for each program. First, there are the built-in defaults which are
provided by the author of the program. Next, there is the system-wide
configuration that is controlled by the administrator. Third,
there is the user-defined configuration, and finally there are the
command line options. Each time the program is executed, the four
sets of configuration options are applied one after another so
that the later sets of options override the earlier settings. The
system-wide configuration is stored in /etc
while the
user-defined configuration is stored in that user's home directory.
Both are are simple text files that can be examined and modified with
any text editor. This makes it easy to compare two configurations
and to transfer the configuration across different machines or user
accounts.
Another mantra which is often heard in connection with Unix is everything is a file. This phrase, while certainly catchy, is slightly incorrect. A more precise version would be everything is controlled by a file descriptor, or, as Ritchie and Thompson stated it, Unix has compatible file, device, and inter-process I/O. Modern Unix systems have pushed this idea further and employ file descriptors also for networking, process management, system configuration, and for certain types of events. The file descriptor concept is thus an abstraction which hides the differences between the objects the file descriptors refer to. It provides a uniform interface for the application programmer who does not need to care about whether a file descriptor refers to a file, a network connection, a peripheral device or something else because the basic I/O operations like open, read, write are the same.
File descriptors are ubiquitous since every Unix program uses them, albeit perhaps implicitly via higher-level interfaces provided by a scripting language. We shall return to this topic when we discuss processes.
All POSIX commands and most other programs are installed along
with one or more man pages (short for manual pages),
which are plain text files that can be formatted and displayed in
various ways. This concept was introduced in 1971 as part of the
Unix Programmer's Manual. The characteristic page layout
and the typical sections (NAME, SYNOPSIS, DESCRIPTION, EXAMPLES,
SEE ALSO) of a man page have not changed since then. The POSIX
man
command is used to view man pages in a terminal. For
example, the command man ls
opens the man page of the
ls
command, and man man
shows the man page
of the man
command itself. Most implementations also
maintain a database of the existing man pages and provide additional
commands to query this database. For example, the whatis
command prints the one-line description of all man pages which match
a pattern while the apropos
command searches the manual
page names and descriptions.
In addition to the man pages for commands, there are man pages for
system calls, library functions, configuration files and more. Each
man page belongs to one of several man sections. For example,
the aforementioned man pages for ls
and man
are part of section 1 (user commands) while section 2 is reserved for
system calls and section 8 for administration commands that can only be
executed by privileged users. By convention, to indicate which section
a command or a function belongs to, the man section is appended in
parenthesis as in mount(8)
. Most Unix systems also offer
translated man pages for many languages as an optional package. Note
that the same name may refer to more than one man page. For example
there is kill(1)
for the user command that kills processes
and also kill(2)
which describes the corresponding system
call. To open the man page of a specific section, one may use a command
like man 2 kill
. The MANSECT
environment
variable can be set to a colon-delimited list of man sections to
change the order in which the man sections are searched.
Consulting the local man pages rather than searching the web has some advantages. Most importantly, the local pages will always give correct answers since they always match the installed software while there is no such relationship between a particular web documentation page and the version of the software package that is installed on the local computer. Working with man pages is also faster, works offline and helps the user to stay focused on the topic at hand.
df
on as many systems as possible to see the
mount points of each filesystem. Then discuss the pros and cons of
a single file hierarchy as opposed to one hierarchy per device. ls /
to list all top-level subdirectories of
the root file system and discuss the purpose of each. Consult the
Filesystem Hierarchy Standard if in doubt. cd / && mc
and start surfing at the root
directory. /usr/local/bin
. Run ls /usr/local/bin
to list this directory. cp(1)
command
online and compare the set of options with the options supported by
the GNU version of that command, as obtained with man cp
on a Linux system. time ls /
and discuss the meaning of
the three time values shown at the end of the output (see
bash(1)
). time head -c 100000000 /dev/urandom > /dev/null
i=0; time while ((i++ < 1000000)); do :; done
htop(1)
in parallel on another terminal and observe the
difference. syscalls(8)
. strace(1)
command prints the system calls that
the given command performs. Guess how many system calls the command
ls -l
will make. Run strace -c ls -l
for
the answer. Read the strace(1)
man page to find suitable
command line options to only see the system calls which try to open
a file. whatis -w
'*' | wc -l
to see how close your guess was. open
, read
, and write
should mean for these devices.
open(2)
could return all kinds of status information,
like the type, model and capabilities of the device. For example,
printers could return the number of paper trays, the amount of toner
left etc. Writing to the file descriptor would cause output on the
device. This would mean to print the text that is written, play the
audio samples, or show the given text on the display. The point to
take away is that the open
, read
,
write
interface is a generic concept that works for
different kinds of devices, not only for storing data in a file on a
hard disk.
The path concept was introduced in the 1960s with the Multics
operating system. Paths will be familiar to the reader because
they are often specified as arguments to commands. Also many
system calls receive a path argument. A path is a non-empty
string of path components which are separated by slash
characters. An absolute path is a path that starts with a
slash, all other paths are called relative. A relative path
has to be interpreted within a context that implies the leading
part of the path. For example, if the implied leading part is
/foo/bar
, the relative path baz/qux
is
equivalent to the absolute path /foo/bar/baz/qux
.
Given a path, there may or may not exist a file or a
directory that corresponds to the path. Path lookup is the
operation which determines the answer to this question, taking the
implied leading part into account in case of relative paths. This
operation is always performed within the kernel and turns out to
be surprisingly complex due to concurrency and performance issues.
Consult path_resolution(7)
on a Linux system to learn
more about how pathnames are resolved to files.
If a path was successfully looked up, each path component up to the
second-last refers to an existing directory while the last component
refers to either a file or a directory. In both cases the directory
identified by the second-last component contains an entry named by the
last component. We call those paths valid. The valid paths
give rise to a rooted tree whose interior nodes are directories and
whose leaf nodes are files or directories. Note that the validity of a
path depends on the set of existing files, not just on the path itself,
and that a valid path may become invalid at any time, for example if
a file is deleted or renamed. Many system calls which receive a path
argument perform path lookup and fail with the No such file or
directory
error if the lookup operation fails.
It depends on the underlying filesystem whether the path components
are case-sensitive or case-insensitive. That is,
whether paths which differ only in capitalization (for example
foo
and Foo
) refer to the same file.
Since the hierarchy of files may be comprised of several filesystems,
some components of the path may be case-sensitive while others are
case-insensitive. As a rule of thumb, Unix filesystems are case
sensitive while Microsoft filesystems are case-insensitive even when
mounted on a Unix system.
Path components may contain every character except the Null
character and the slash. In particular, space and newline characters
are allowed. However, while dots are allowed in path components if
they are used together with other characters, the path components
.
and ..
have a special meaning: every
directory contains two subdirectories named .
and
..
which refer to the directory itself and its parent
directory, respectively.
One consequence of the special meaning of the path components
.
and ..
is that there are infinitely
many paths which refer to the same file. Indeed, given the path
/foo/bar
, one may construct different paths which
refer to the same file by inserting any number of /.
or ../foo
after the first component. For example,
/foo/./bar
and /foo/../foo/bar
both refer
to the same file. If relative paths have to be taken into account as
well, even more paths can be constructed.
This illustrates the fundamental difference between a path and a file. Paths can be mapped to files, but not the other way around. In particular, there is no such thing like "the list of paths which have changed since yesterday". Also, it is incorrect to assume that two files are different just because their paths are different. The concept of hard- and soft links discussed below further complicates the relationship between paths and files.
Globbing, also known as pathname expansion, is a pattern matching language for paths which was already present in the earliest Unix versions. The glob operation generates a set of valid paths from a glob pattern by replacing the pattern by all matching paths.
Glob patterns may contain special characters called wildcards. The wildcard characters are:
*
: match any string, ?
: match any simple character, [...]
: match any of the enclosed characters. The complete syntax rules for glob patterns and the exact
semantics for pattern matching are described in POSIX and in
glob(7)
. Any POSIX-compliant shell performs globbing
to construct the command to be executed from the line entered at
the prompt. However, POSIX also demands system functions which make
globbing available to other applications. These are implemented as
part of libc.
There are a few quirks related to globbing which are worth to
point out. First, if no valid path matches the given pattern, the
expansion of the pattern is, by definition according to POSIX, the
pattern itself. This can lead to unexpected results. Second, files
which start with a dot (so-called hidden files) must be
matched explicitly. For example, rm *
does not
remove these files. Third, the tilde character is no wildcard,
although it is also expanded by the shell. See the exercises for more
examples.
POSIX globbing has some limitations. For example, there is no
glob pattern which matches exactly those files that start with an
arbitrary number of a
characters. To overcome these
limitations, some shells extend the matching language by implementing
extended glob patterns which are not covered by POSIX. For
example, if extended globbing feature of bash(1)
is
activated via the extglob
option, the extended glob
pattern +(a)
matches the above set of files.
ls
-l
indicates the type of the last path component: for
directories a d
character is shown while files (also
called regular files in this context) get a hyphen character
(-
). Besides directories and regular files, the following
special file types exist:
l
) c
and b
) c
) and block devices (b
). Regardless
of the flavor, each device node has a major and a minor number
associated with it. The major number indicates the type of the
device (e.g. a hard drive, a serial connector, etc.) while the
minor number enumerates devices of the same type. On most systems
the device nodes are created and deleted on the fly as the set of
connected devices changes, for example due to a USB device being
added or removed. However, device nodes can also be created manually
with the mknod(1)
command or the mknod(2)
system call. Device nodes do not necessarily correspond to physical
devices. In fact, POSIX demands the existence of a couple of
virtual devices with certain properties. We look at some of
these in the exercises. The access to device nodes which do correspond
to physical devices is usually restricted to privileged users. s
) AF_INET
and AF_INET6
address
families are for internet protocols (IP) while AF_LOCAL
(also known as AF_UNIX
) is used for communication between
processes on the same machine. These local sockets are also called
Unix domain sockets. They can be bound to a path which
refers to a file of type socket. Regardless of the address family,
processes can exchange data via sockets in both directions, but
the local sockets support additional features, like passing process
credentials to other processes. p
) mkfifo(1)
command or the mkfifo(3)
library function. Note that the type of a file is never inferred from the path. In particular the suffix of the path (everything after the last dot) is just a convention and has no strict connection to the file type. Also there is no difference between text and binary files.
The stat(2)
system call returns the metadata
of the file or directory that corresponds to the given path. The
stat(1)
command is a simple program which executes this
system call and prints the thusly obtained metadata in human-readable
form, including the file type. This is done without looking at the
contents of the file because metadata are stored in a special area
called the inode. All types of files (including directories)
have an associated inode. Besides the file type, the inode stores
several other properties prescribed by POSIX. For example the file
size, the owner and group IDs, and the access permissions are all
stored in the inode. Moreover, POSIX requires to maintain three
timestamps in each inode:
To illustrate the difference between the mtime and the ctime,
consider the chgrp(1)
command which changes the group
ID of the file or directory identified by its path argument. This
command sets the ctime to the current time while the mtime is left
unmodified. On the other hand, commands which modify the contents of
a file, such as echo foo >> bar
, change both the mtime
and the ctime.
The inode of each file or directory contains twelve mode
bits, nine of which are the permission bits which
control who is allowed to access the file or directory, and how. The
permission bits are broken up into three classes called user
(u
), group (g
) and others
(o
). Some texts refer to the first and last class as
"owner" and "world" instead, but we won't use this naming to avoid
confusion. Each class contains three bits. The bits of the "user"
class apply to the file owner, that is, the user whose ID is stored in
the inode. The "group" category applies to all non-owners who belong
to the group whose ID is stored in the inode. The third category
applies to all remaining users. The three bits of each class refer to
read/write/execute permission. They are therefore named r
,
w
and x
, respectively. The permission
bits mean different things for directories and non-directories,
as described below.
Directories | Non-directories | |
---|---|---|
r |
The permission to list the directory contents. More precisely,
this bit grants the permission to call opendir(3)
to obtain a handle to the directory which can then be passed to
readdir(3) to obtain the directory contents. |
If read permission is granted, the open(2) system
call does not fail with the permission denied error,
provided the file is opened in read-only mode. The system call may
fail for other reasons, though.
|
w |
The permission to add or remove directory entries. That is, to create new files or to remove existing files. Note that write permission is not required for the file that is being removed. | Permission to open the file in write-only mode in order
to perform subsequent operations like write(2)
and truncate(2) which change the contents of the
file. Non-directories are often opened with the intention to both
read and write. Naturally, such opens require both read and write
permissions. |
x |
The permission to search the directory. Searching
a directory means to access its entries, either by retrieving
inode information with stat(2) or by calling
open(2) on a directory entry. |
Run the file. This applies to binary executables as well
as to text files which start with a shebang, #! ,
followed by the path to an interpreter. We shall cover file execution
in more detail below. |
To run the regular file /foo/bar/baz
, search
permission is needed for both foo
and bar
,
and execute permission is needed for baz
. Similarly, to
open the regular file foo/bar
for reading, we need execute
permissions on the current working directory and on foo
,
and read permissions on bar
.
A numeric permission mode is a three octal digit (0-7) number, where the digits correspond to the user, group, other classes described above, in that order. The value of each digit is derived by adding up the bits with values 4 (read), 2 (write), and 1 (execute). The following table lists all eight possibilities for each of the three digits.
Octal Value | Symbolic Representation | Meaning |
---|---|---|
0 | --- |
no permissions at all |
1 | --x |
only execute permission |
2 | -w- |
only write permission |
3 | -wx |
write and execute permission |
4 | r-- |
only read permission |
5 | r-x |
read and execute permission |
6 | rw- |
read and write permission |
7 | rwx |
read, write and execute permission |
The chmod(1)
command changes the permission
bits of the file identified by the path argument. For example,
chmod 600 foo
sets the permissions of foo
to
rw-------
. Besides the octal values, chmod(1)
supports symbolic notation to address the three classes described
above: u
selects the user class, g
the
group class, o
the class of other users. The symbolic
value a
selects all three classes. Moreover, the letters
r
, w
and x
are used to set or
unset the read, write and execute permission, respectively. The above
command is equivalent to chmod u=rw,g=---,o=--- foo
. The
+
and -
characters can be specified instead
of =
to set or unset specific permission bits while
leaving the remaining bits unchanged. For example chmod go-rw
foo
turns off read and write permissions for non-owners.
Unprivileged users can only change the mode bits of their own files or directories while there is no such restriction for the superuser.
Links make it possible to refer to identical files through different paths. They come in two flavors: hard and soft. Both types of links have advantages and disadvantages, and different limitations. We start with hard links because these existed already in the earliest Unix versions.
A file can have more than one directory entry that points to its
inode. If two directory entries point to the same inode, they are
said to be hard links of each other. The two entries are
equivalent in that they refer to the same file. It is impossible
to tell which of the two is the "origin" from which the "link"
was created. Hard links are created with the link(2)
system call or the ln(1)
command. Both take two path
arguments, one for the existing file and one for the directory entry
to be created. The filesystem maintains in each inode a link counter
which keeps track of the number of directory entries which point to the
inode. The link(2)
system call increases the link count
while unlink(2)
decrements the link count and removes
the directory entry. If the decremented counter remains positive,
there is still at least one other directory entry which points to
the inode. Hence the file is still accessible through this other
directory entry and the file contents must not be released. Otherwise,
when the link counter reached zero, the inode and the file contents
may be deleted (assuming the file is not in use).
There are several issues with hard links. For one, hard links
can not span filesystems. That is, the two path arguments for
link(2)
have to refer to files which reside on the
same filesystem. Second, it is problematic to create hard links to
directories. Early Unix systems allowed this for the superuser,
but on Linux the attempt to hard-link a directory always fails.
To address the limitations of hard links, soft links, also
called symbolic links (or symlinks for short),
were introduced. A soft link can be imagined as a special text file
containing a single absolute or relative path, the target of
the link. For relative paths the implied leading part is the directory
that contains the link. A soft link is thus a named reference in
the global hierarchy of files. Unlike hard links, the soft link
and its target do not need to reside on the same filesystem, and
there is a clear distinction between the link and its target. Soft
links are created with symlink(2)
or by specifying the
-s
option to the ln(1)
command.
A soft link and its target usually point to different inodes. This
raises the following question: Should system calls which receive a
path argument that happens to be a soft link operate on the link
itself, or should they follow (or dereference)
the link and perform the operation on the target? Most system calls
follow soft links, but some don't. For example, if the path argument
to chdir(2)
happens to be a soft link, the link is
dereferenced and the working directory is changed to the target of
the link instead. The rename(2)
system call, however,
does not follow soft links and renames the link rather than its
target. Other system calls, including open(2)
, allow
the caller to specify the desired behaviour by passing a flag to the
system call. For yet others there is a second version of the system
call to control the behaviour. For example, lstat(2)
is
identical to stat(2)
, but does not follow soft links.
It is possible for a soft link to refer to an invalid path. In
fact, ln(1)
and symlink(2)
do not consider
it an error if the target does not exist, and happily create a soft
link which points to an invalid path. Such soft links are called
dangling or broken. Dangling soft links also occur
when the target file is removed or renamed, or after a mount point
change.
Soft links may refer to other soft links. System calls which
follow soft links must therefore be prepared to resolve chains of
soft links to determine the file to operate on. However, this is not
always possible because soft links can easily introduce loops into the
hierarchy of files. For example, the commands ln -s foo bar;
ln -s bar foo
create such a loop. System calls detect this
and fail with the Too many levels of symbolic links
error when they encounter a loop.
Another issue with both soft and hard links is that there is no simple way to find all directory entries which point to the same path (soft links) or inode (hard links). The only way to achieve this is to traverse the whole hierarchy of files. This may be prohibitive for large filesystems, and the result is unreliable anyway unless the filesystems are mounted read-only.
It is possible to choose among all possible paths which refer
to the same file a canonical path, that is, a shortest
(counting characters) absolute path which does not contain any soft
links. In fact, the POSIX standard requires each Unix system library
to provide the realpath()
library function which performs
the following substitutions on the given path: First, the path to
the current working directory is prepended if the given path is
relative (does not begin with a slash). Second, symbolic links are
replaced by their targets. Third, any occurrences of /.
and foo/..
are removed. The thusly transformed path is
returned by the function as the canonical path.
Each path can be canonicalized in this way, but not all paths
which refer to the same file give rise to the same canonical path. For
example, /tmp/foo
and /tmp/bar
could refer
to regular files which are hard links of each other. In this case the
paths refer to the same file, yet the paths are different and already
canonicalized. The same can happen when a file system (or a subtree
of it) is bind mounted. That is, the file system tree is
visible at two or more locations in the global directory tree.
foo
is an existing directory. Guess what the
command mv foo bar
will do in each of the following cases:
(a) bar
does not exist, (b) bar
exists and
is a regular file, (c) bar
exists and is a directory.
Verify your guess by running the command. if test
-e "$file"; do something_with "$file"; fi
. Explain
why this approach is not bullet-proof. How could this be
fixed? touch file-{1..100}
to create 100 files. Guess
what the following commands will print. Run each command to confirm.
ls file-
ls file-*
ls file-?
ls file-[1-4]
ls file-[1,3,5,7,9]*
bash(1)
that matches all valid paths whose last component starts with
file-
, followed by any number of odd digits (1, 3, 5,
7, or 9). for
f in file-*; do something_with "$f"; done
. Hint: Search
bash(1)
for "nullglob".
-r
with echo >
-r
. Try to remove the file with rm -r
and
discuss why this doesn't work as expected. Find a way to get rid
of the file. Discuss what happens if you run rm *
in a
directory which contains a file named -r
. PATH
variable is a
colon-separated list of directories in which the shell looks for
commands to execute. Discuss the dangers of including the current
working directory in this list. id
to determine a group G
you
belong to but is not your primary group. Consider the following
commands mkdir foo; chgrp $G foo; touch foo/bar
. What
is the group ID of foo/bar
? Run the same commands, but
insert chmod g+s foo
as the second-to-last command. man null
and man zero
to learn
about the properties of these two character devices. -d
option of touch(1)
. echo hello > foo
,
cat foo
, chmod 600 foo
, echo world >>
foo
. Check the three timestamps with stat foo
after each command. ls -ld ~
. Who can
access its contents? Also look at the permission bits of
other people's home directory. w
bit is set in the others
class of the permission bits. Create a world-writable
directory with mkdir foo; chmod 777 foo
and create a file in the new directory: echo hello
> foo/bar
. Is a different user allowed to create
another file there (echo world > foo/baz
)? Can
he remove it again (rm foo/baz
)? Will he succeed
in removing foo/bar
although it is owned by you
and not writable to him? Try the same with the sticky
bit turned on (chmod 1777 foo
). rw-r--r--
into octal, and 755 into
rwx
-notation. r--------
, check whether you still can
list/execute it. Do the same with --x------
. echo hello > foo
,
create soft and hard links with ln -s foo soft
and ln foo hard
. Examine the inode numbers
and link counts using the command stat foo soft
hard
. Remove foo
and repeat. Try to
access the file content through both links. Guess what
realpath foo soft hard
will print. Run the
command to confirm. ln -s /nope foo
. Would
you expect the commands ls -l foo
and cat foo
succeed? Run these commands to verify your guess. ls -i
.
In the general case one additionally has to check that the
two device IDs
which identify the underlying file
systems are also identical. Like the inode number, the device ID
is part of the metadata of the file. It can be obtained by running
stat(1)
.
find /usr/bin/ -perm -2000 -ls
to see all SUID
executables in /usr/bin
. Discuss why those programs have
the SUID bit set. a
? Discuss whether
the question "What's the path of a given file?" makes sense.
foo/a
, foo/testdir/a
,
foo/testdir/testdir/a
etc. all refer to the same file, there
are infinitely many paths for the file a
. Hence the
question makes no sense: There is no such thing as the
path to a file.
..
refers to the
parent directory. Give an example of a path to a directory where
"parent directory" and "directory identified by the path obtained by
removing the last path component" are different. Which of the two
interpretations of ..
does bash apply when you type
cd ..
at the bash prompt?
A program consists of instructions and data stored in a regular file. A user process is an instance of a running program. This is in contrast to kernel processes which are created directly by the kernel and have no relationship to executable files. Since we shall only be concerned with user processes, we will refer to these as "processes" from now on. In this section we'll see how processes are created and removed. We will then take a closer look at the enviroment of a process and discuss how processes communicate with each other.
When the system boots, there is only one process, the
init process, which is created by the kernel at the end
of the boot sequence by executing /sbin/init
. All
other processess are created from existing processes by means of
the fork(2)
system call. The process which called
fork(2)
is said to be the parent of the newly
created child. After fork(2)
returns, both
parent and child are executing independently of each other. Both
processes may call fork(2)
again to create further
processes. This gives rise to a tree structure where the processes
are the nodes of the tree with init being the root node. The edges
of the tree describe the parent-child relationships.
If there are more processes than CPUs, not all processes can run simultaneously. It is the mission of the kernel's task scheduler to assign processes to CPUs and to perform context switches. That is, to take away the CPU from a running process in order to give another process the chance to run. The scheduler has to choose the duration of each process' time slice and it must pick the process to switch to when the time slice of the current process has expired or the process gives up the CPU voluntarily, for example because it is waiting for an I/O operation to complete. This is a non-trivial task at least for modern multi-CPU systems with non-uniform memory access (NUMA) where the memory access times depend on the memory location and the processor core. Things don't get easier if the CPU clock speed can vary and/or scheduling must be power-aware to optimize battery time. To make good decisions, some information has to be provided by the processes or by a system-wide policy. One elementary way to prioritize certain processes over others is via nice levels which we shall discuss below.
The normal way for a process to terminate is to call
exit(3)
after it has done its work. This function
transfers an integer value, the exit status, to the
kernel. The exit status can only be retrieved by the parent of
the terminating process. To illustrate this concept, imagine an
interactive shell which creates one child each time the user enters
a command. The children are short living while the parent, the shell,
stays around for much longer. During command execution the parent needs
to wait for its child to terminate. It may then want to tell whether
the child has terminated successfully. To achieve this, the parent
calls one of the wait system calls (wait(2)
,
waitpid(2)
, waitid(2)
) which block until the
child has terminated, then return the child's exit status. After the
child called exit(3)
but before the parent has called
one of the wait functions, the kernel needs to keep at least the
exit status (and possibly further information) about the terminated
child. During this time window the child has already terminated but
a part of it still exists in kernel memory. Processes in this state
are aptly called zombies.
Unlike in the shell scenario outlined above, a process might well have any number of children at the time it terminates. Its children then become orphans as they lose their parent. The kernel cannot simply remove the terminated process from the process tree because this would disconnect its orphaned children from the other processes in the tree, destroying the tree structure. To avoid this, orphans are reparented to init, that is, made children of the init process. This works because the init process never terminates.
There are several programs which show information about
processes. The POSIX ps(1)
command prints a list of
processes. It has many options that control which processes are
shown and how the output is formatted. Similar programs which are not
covered by POSIX are pstree(1)
, top(1)
and
htop(1)
. The former shows the tree structure while the
latter two provide a dynamic real-time view of the process tree. The
exercises of this section invite the reader to become familiar with
these essential programs.
When a process calls fork(2)
, the newly created
child process starts out as a copy of the calling process. However,
the reason to create a new process is usually to let the child do
something different than the parent. Therefore, fork(2)
is often followed by a second system call which replaces the child
process with a different program. There are several similar system
calls which do this, with slight semantic differences. We refer to
this family of system calls as the exec system calls.
All exec system calls receive a path argument from which they determine an executable file that contains the program to run. Linux and BSD store executables in Executable and Linkable Format (ELF). Executables are typically linked dynamically. That is, the dependent libraries (at least libc, but possibly many more) are loaded at runtime from separate files. This is in contrast to static linking which includes all dependencies in the executable, making the executable self-contained but also larger and harder to maintain. Regardless of the type of linking, when the program is loaded, it completely replaces the previously running program. Note that there can be more than one process at the same time which executes the same program.
Files in ELF format are called native executables because
they contain machine instructions which can be executed directly
by the CPU. Another type of executables are scripts,
also known as interpreter files. Scripts are text files
which start with the shebang (#!
). They can
not run directly but have to be interpreted at runtime
by another program, the interpreter. Nevertheless, it is
possible to execute a script as if it were a native executable:
by passing the path to one of the exec system calls or by typing
the path at the shell prompt. The exec system call recognizes that
the given file is a script by investigating the first line, which
contains the path to the interpreter after the shebang, optionally
followed by options to the interpreter. The kernel then executes the
interpreter rather than the script, passing the path to the script as
an argument. For example, if /foo/bar
is being executed,
and the first line of this file is #!/bin/sh
, the kernel
executes /bin/sh /foo/bar
instead. Popular interpreters
besides /bin/sh
include /bin/bash
,
/usr/bin/perl
, /usr/bin/python
and
/usr/bin/awk
.
The kernel must always be aware of the set of all objects which are currently in use. This set is often called the system-wide table of open files although not all entries refer to files. In fact, an entry may refer to any object that supports I/O operations, for example a network socket. Each entry is called a file description, which is a somewhat unfortunate term that was coined by POSIX. A file description records information about the object itself as well as the current state of the reference, including the file offset, if applicable, and the status flags which affect how future I/O operations are going to be performed through this reference.
The kernel maintains for each process an array of pointers to file
descriptions. Each such pointer is a file descriptor. Unlike
files and file descriptions, a file descriptor always corresponds
to a process and is identified by a non-negative number, the index
into the pointer array of that process. This index is returned by
system calls like open(2)
or socket(2)
.
As far as user space programs are concerned, a file descriptor is
synonymous with this integer. It can be regarded as an abstract
handle that must be supplied to subsequent I/O operations
like read(2)
or write(2)
to tell the system
call the target object of the operation.
The shell automatically creates three file descriptors for each
process which are identified by the integers 0, 1 and 2. They are
called stdin, stdout, and stderr, which is
short for standard input/output/error. It is possible, and in
fact common, that all three file descriptors point to the same file
description: the terminal device. Many command line tools read their
input from stdin, write normal output to stdout, and error messages
to stderr. For example, when the POSIX command cat(1)
is invoked with no arguments, it reads data from stdin and writes
the same data to stdout.
Signals are another ancient Unix concept that dates back to the early 1970s and was standardized in POSIX long ago. This concept facilitates a rudimentary form of inter process communication (IPC) between unrelated processes. Signals can be regarded as software interrupts that transmit a notification event from the sending process to the target process. The event is sent asynchronously, meaning that the interruption can happen at any location of the code flow.
It is fair to say that most non-trivial programs, including scripts, have to deal with signals. All major scripting languages (bash, python, perl, ruby, etc.) provide an API for signal handling. The interpreter of the scripting language ends up calling the POSIX system functions, so we will only look at these.
Signals are identified by name or by a numerical ID. For example,
SIGINT
(interrupt from keyboard) is the name for signal
number 2. POSIX defines 31 standard signals plus at least
eight real-time signals. The standard signals can be
subdivided according to the origin of the signal as follows.
SIGFPE
(floating-point exception)
signal for the process. Another example for a signal that originates
from a hardware trap is SIGSEGV
(segmentation fault)
which occurs when a process attempts to reference a memory address
that has not been mapped (i.e., marked as valid) by the memory
management unit (MMU) of the CPU. SIGCHLD
(child terminated),
which is sent to the parent process when one of its child processes
terminates. Another example is SIGWINCH
(window resize),
which is generated when the geometry of the controlling terminal of
a process changes. kill(1)
, calls raise(2)
or kill(2)
to instruct the kernel to generate a
signal. Examples are SIGTERM
, which issues a termination
request, and SIGUSR1
and SIGUSR2
which are
reserved for use by application programs. signal(7)
for the full list of
signals and their semantics.
SIGINT, SIGTERM
and SIGKILL
SIGINT
is generated for the foreground processes
when the interrupt character (CTRL+C) is pressed in a
terminal. For example, if CTRL+C is pressed while the shell pipeline
find | wc
is executing, SIGINT
is sent
to both processes of the pipeline. SIGTERM
is the
default signal for the kill(1)
command. It requests
the target process to run its shutdown procedure, if any, then
terminate. SIGKILL
instructs the kernel to terminate the
target process immediately, without giving the process the chance to
clean up. This signal can originate from a process or from the kernel
in response to running out of memory. To keep the system working, the
kernel invokes the infamous out of memory killer (OOM killer)
which terminates one memory-hungry process to free some memory. SIGSTOP
, SIGTSTP
and SIGCONT
SIGSTOP
instructs the task scheduler of the kernel to
no longer assign any CPU time to the target process until the process
is woken up by a subsequent SIGCONT
. SIGTSTP
(stop typed at terminal) stops all foreground processes of a terminal
session. It is generated when the stop character (CTRL+Z)
is pressed in a terminal. Processes may set the signal disposition of most signals
to control what happens when the signal arrives. When no disposition
has been set, the signal is left at its default disposition so
that the default action is performed to deliver the signal.
For most signals the default action is to terminate the process,
but for others the default action is to ignore the signal.
If the signal is neither ignored nor left at its default disposition,
it is said to be caught by the process. To catch a signal the
process must tell the kernel the address of a function, the signal
handler, to call in order to deliver the signal. The set of
signal dispositions of a process can thus be imagined as an array
of function pointers with one pointer per possible signal. If the
process catches the signal, the pointer points to the corresponding
signal handler. A NULL pointer represents a signal that was left at
its default disposition while the special value SIG_IGN
indicates that the process explicitly asked to ignore this signal.
Signals can also be blocked and unblocked. When
a signal is generated for a process that has it blocked, it remains
pending. Pending signals cause no action as long as the
signal remains blocked but the action will be performed once the
signal gets unblocked. SIGKILL
and SIGSTOP
can not be caught, blocked, or ignored.
Real-time signals are similar to SIGUSR1
and
SIGUSR2
in that they have no predefined meaning but
can be used for any purpose. However, they have different semantics
than the standard signals and support additional features. Most
importantly, real-time signals are queued, meaning that in
contrast to standard signals the same real-time signal can be pending
multiple times. Also, the sending process may pass an accompanying
value along with the signal. The target process can obtain this
value from its signal handler along with additional information like
the PID and the UID of the process that sent the signal.
Some system calls including read(2)
and
write(2)
may block for an indefinite time. For
example, reading from a network socket blocks until there is data
available. What should happen when a signal arrives while the process
is blocked in a system call? There are two reasonable answers: Either
the system call is restarted, or the call fails with the
Interrupted system call
error. Unfortunately, different
flavors of Unix handle this case differently by default. However,
applications may request either behaviour by setting or clearing the
SA_RESTART
flag on a per-signal basis.
Now that we have a rough understanding of processes we look closer at the information the kernel maintains for each process. We already discussed the array of file descriptors and the array of signal dispositions. Clearly both are process specific properties. As we shall see, there is much more what constitutes the environment of a process.
Each process is identified by a unique process ID
(PID), which is a positive integer. The init
process is
identified by PID 1. PIDs are assigned in ascending order, but are
usually restricted to the range 1..32767. After this many processes
have been created, PIDs wrap and unused PIDs are recycled for new
processes. Thus, on a busy system on which processes are created and
terminated frequently, the same PID is assigned to multiple processes
over time.
Not all processes call fork(2)
to create a child
process, but each process except the init process has a unique
parent. As described before, this is either the "real" parent (the
process which created the process earlier) or the init process that
"adopted" the orphaned process in case the real parent terminated
before the child. The process ID of the parent (PPID) is thus
well-defined. A process can obtain its PID and PPID with the
getpid(2)
and getppid(2)
system calls.
Each process runs on behalf of a user (possibly the superuser)
which is identified by its user ID (UID) and belongs to
one or more groups, identified by one or more group IDs
(GIDs). The superuser is identified by UID zero. When we talked
about the permission bits of files and directories, we said that
suitable permissions are needed for system calls which operate on
files (open(2)
, stat(2)
, etc.). A more
precise statement is that the process which calls, say,
open(2)
needs to have these permissions. To decide this,
the kernel needs to take into account the UID and GIDs of the process
that called open(2)
, the UID and the GID stored in the
inode of the file that is being opened, and the permission bits of
this file. The UID is also taken into account for kill(2)
because unprivileged processes (non-zero UID) can only send signals
to processes which run on behalf of the same user while the superuser
may target any process.
Each process has a current working directory (CWD)
associated with it. When the user logs in, the CWD of the login shell
process is set to his home directory, which should always
exist and have the read, write and execute permission bits set for
the user. The CWD can later be changed with chdir(2)
and be retrieved with getcwd(3)
. The CWD is used as the
starting point for path searches for relative paths. It affects most
system calls which receive a path argument. For example, if the CWD
is /foo/bar
and the relative path baz/qux
is passed to open(2)
, the kernel will attempt to open
the file which is identified by /foo/bar/baz/qux
.
Many programs accept arguments to control their behavior.
In addition to the path to the program that is to be executed,
all variants of the exec system calls receive an array of arguments
called the argument vector. For example, when the command
ls -l foo
is executed, the argument vector contains
the two strings "-l"
and "foo"
. Note that
the argument vector is not part of the program but is tied to the
process. It is passed to the main function at startup so that the
program may evaluate it and act accordingly.
Another way to pass information to a program is via environment
variables. Environment variables are strings of the form
name=value
. POSIX describes the API to maintain the
environment variables of a process. Environment variables are set
with setenv(3)
or putenv(3)
, the value of a
variable can be retrieved with getenv(3)
, and a variable
and its value can be deleted with unsetenv(3)
. The set of
environment variables is sometimes called the environment
of the process, although we use this term in a broader sense to
describe the entirety of all metadata maintained by the kernel about
the process, including but not limited to environment variables.
Each process also has about a dozen resource limits
that can be set and queried with the POSIX setrlimit(2)
and getrlimit(2)
functions. Each limit controls a
different aspect. For example, RLIMIT_CPU
limits the
CPU time the process is allowed to use and RLIMIT_NOFILE
controls how many open files it may have at a time. For each resource
there is a soft and a hard limit. The kernel
enforces the value stored as the soft limit. This value may be set
by an unprivileged process to any value between zero and the hard
limit. Unprivileged processes may also reduce (but not increase) their
hard limits. Once a hard limit is reduced, it can not be increased
any more. For RLIMIT_CPU
a special convention applies:
If the soft limit is reached, the kernel sends SIGXCPU
(CPU time limit exceeded) to notify the process about this fact so
that it can terminate orderly (e.g., remove temporary files). When
the hard limit is reached, the kernel terminates the process as if
it received SIGKILL
.
The nice level of a process provides a hint for
the task scheduler to give the process a lower or higher priority
relative to other processes. Nice levels range between -20 and 19. A
high nice level means that the process wants to be nice to other
processes, that is, should run with reduced priority. Low nice levels
indicate important processes that should be prioritized over other
processes. The default nice level is zero. Unprivileged users may
set the nice level of new processes with the nice(1)
command to any non-negative value. They can also increase the nice
level of their existing processes with renice(1)
, but
never decrease it. The superuser, however, may set the nice level
of any process to an arbitrary value in the valid range.
The bulk of the properties discussed so far are inherited by the
child after a fork(2)
. Specifically, the child gets the
same array of file descriptors and signal dispositions as its parent,
runs on behalf of the same user, has the same working directory,
the same resource limits and nice level, and also the same set
of environment variables with identical values. The PID and PPID,
however, are different of course.
After a process has called an exec function to replace itself with a new program, its signal handlers no longer exist because they were part of the program code which has been replaced. Therefore the exec system calls reset the disposition of all signals that were caught to the default disposition. Signals that were being ignored keep being ignored, however.
Although not covered by POSIX, at least Linux, NetBSD and FreeBSD
provide information about processes via the process filesystem
(procfs), which is usually mounted on /proc
. The process
filesystem is a pseudo-filesystem, i.e., it has no underlying
storage device. Files and directories are faked by the kernel as they
are accessed. Each process is represented by a numerical subdirectory
of /proc
which is named by the PID. For example,
/proc/1
represents the init process. The aforementioned
process utilities (ps(1)
, top(1)
, etc.) read
the contents of the process filesystem in order to do their job.
Each /proc/[pid]
directory contains the same set
of files although this set is different between Unixes. These files
expose much of the environment of the process to user space. The Linux
procfs implementation provides text files named environ
and limits
which contain the current environment and
the resource limits of the process, respectively. Moreover, the
file descriptor array of each process is exposed in the files of
the /proc/[pid]/fd
directory. Linux and NetBSD (but not
FreeBSD) also provide a cwd
soft link which points to
the current working directory of the process.
The pipe(2)
system call takes no arguments and
creates two file descriptors for the calling process which are tied
together as a unidirectional first in, first out data channel. One
file descriptor is the read end of the pipe, the other is
the write end. Data written to the write end is buffered by
the kernel and can be obtained by reading from the read end.
One application of pipes is communication between
related processes. A process first creates a pipe, then calls
fork(2)
to create a child process. The child inherits
a copy of both pipe file descriptors. Hence the parent process can
communicate with the child by writing a message to the write end of
the pipe for the child to read.
This approach depends on file descriptor inheritance across
fork(2)
, so it does not work in the situation
where neither process is an ancestor of the other. Files of
type fifo (named pipes) overcome this restriction. To
establish a connection between two unrelated processes,
both processes call open(2)
to obtain a file
descriptor which is associated with the fifo. One process passes
the O_WRONLY
flag to open the file for writing while
the other passes O_RDONLY
to open it for reading. The
two processes may then communicate in the same way as with the
pipe(2)/fork(2)
approach.
The POSIX dup(2)
and dup2(2)
system
calls allow a process to manipulate the entries of its file descriptor
array. In particular the standard file descriptors 0, 1, and 2 can be
replaced. By doing so before performing an exec system call, it can
be arranged that the replacement program starts with, say, its stdout
file descriptor be redirected to the write end of a pipe. Note that
the replacement program does not need any modifications for this to
work, and might not even be aware of the fact that it is not writing
its output to the terminal as usual.
Shells employ this technique to implement the |
operator which "pipes" the output of one command "into" another
command. For example, the pipeline ls | wc
works
as follows: First the shell creates a pipe, then it calls
fork(2)
twice to create two processes which both
get a copy of the two pipe file descriptors. The first process
replaces its stdout file descriptor with the write end of the
pipe and performs an exec system call to replace itself with the
ls(1)
program. The second process replaces its stdin
file descriptor with the read end of the pipe and replaces itself
with wc(1)
. Since ls(1)
writes to stdout
and wc(1)
reads from stdin, wc(1)
processes
the output of ls(1)
.
The POSIX standard requires a compliant Unix system to provide
a collection of functions that let applications perform input and
output by means of operations on streams. This programming
interface, known as stdio for standard input/output,
is part of every Unix system since 1979. Every program which contains
a printf(3)
statement relies on stdio.
The stdio functions are implemented as part of libc on top of the
open(2)
, read(2)
and write(2)
system calls which are implemented in the kernel. Roughly speaking,
stdio replaces the file descriptor API by a more abstract API
which centers around streams. A stream is an opaque data structure
which comprises a file descriptor and an associated data buffer for
I/O. Each program has three predefined streams which correspond to
the three standard file descriptors (stdin, stdout and stderr). The
stdio API contains well over 50 functions to create and maintain
streams and to perform I/O on streams. These functions take care of
the characteristics of the underlying file description. For example,
they automatically try to select the optimal I/O buffer size.
Many applications rely on stdio because of convenience. For
one, the buffers for read(2)
and write(2)
must be allocated and freed explicitly by the application, and care
must be taken to not overflow these buffers. With stdio, this task
is done by the stdio library. Second, formatted I/O is
much easier to do with the stdio functions because the programmer
only has to provide a suitable format string to convert
between the machine representation and the textual representation of
numbers. For example, by passing the format string "%f"
to scanf(3)
, the programmer tells the stdio library to
read a floating point number stored in textual representation from the
specified stream, convert it to machine representation and store it
in the given variable. The fprintf(3)
function works the
other way round: the value is converted from machine representation
to text, and this text is written to the stream. Both functions can
deal with various formats, like scientific notation for floating
point numbers (e.g., 0.42E-23). With stdio it is easy to customize
the formatted output, for example add leading zeros or select the
number of decimal digits in the textual representation of a floating
point number.
Another reason why many programs rely on stdio is that it performs
buffered I/O. With buffered I/O not each read/write operation
results in a system call. Instead, data read from or written to the
stream is first stored in the user space buffer that is associated
with the stream. This is a performance improvement for applications
which perform many small I/O operations because every system call
incurs some overhead. Buffers may be flushed explicitly by
calling fflush(3)
, or implicitly by the stdio library. The
criteria which cause an implicit flush depend on the buffering
type of the stream. Three buffering types exist.
read(2)
or
write(2)
system call. By default the stderr stream is
unbuffered to display error messages as quickly as possible. isatty(3)
). The exercises on stdio focus on the three different buffering types because this is a common source of confusion.
Isolation refers to the concept that each process gets its own
virtual address space. A rough understanding of the memory
management system and the layout of the virtual address space of
a process helps to locate the source of common problems like the
infamous segmentation fault
error, and to realize that
putatively simple questions such as "how much memory is my process
currently using?" are in fact not simple at all, and need to be made
more precise before they can be answered.
Virtual memory is an abstraction of the available memory resources.
When a process reads from or writes to a memory location, it refers
to virtual addresses (illustrated as the left box of the
diagram). Virtual addresses are mapped by the MMU to physical
addresses which refer to physical memory locations (right
box). The mapped virtual address space of a process is a
collection of ranges of virtual addresses which correspond to physical
memory addresses (blue areas). By storing less frequently-accessed
chunks of virtual memory (yellow) on the swap area (grey), applications
can use more memory than is physically available. In this case the
size of the valid virtual addresses (blue and yellow areas together)
exceeds the amount of physical memory (orange and green areas). Any
attempt to access an unmapped memory location (red and yellow areas)
results in a page fault, a hardware trap which forces the CPU
back into kernel mode. The kernel then checks whether the address is
valid (yellow) or invalid (red). If it is invalid, the kernel sends
SIGSEGV
, which usually terminates the process with
the segmentation fault
error. Otherwise it allocates
a chunk of unused physical memory, copies the chunk from the swap
area to the newly allocated memory and adjusts the mapping (i.e.,
a yellow part becomes blue). The virtual memory concept increases
stability and security because no process can access physical memory
which belongs to the kernel or to other processes (orange areas).
We've already seen that the fork(2)
system call
creates a new process as a duplicate of the calling process. Since
the virtual address space of the calling process (a) might be large
and (b) is likely to be replaced in the child by a subsequent call
to an exec function, it would be both wasteful and pointless to
copy the full address space of the parent process to the child. To
implement fork(2)
efficiently, operating systems
employ an optimization strategy known as Copy on Write
(CoW). The idea of CoW is that if multiple callers ask for resources
which are initially indistinguishable, you can give them pointers to
the same resource. This function can be maintained until a caller
tries to modify its copy of the resource, at which point a true
private copy is created to prevent the changes becoming visible to
everyone else. The primary advantage is that if a caller never makes
any modifications, no private copy needs ever be created. The
fork(2)
system call marks the pages of the virtual address
space of both the parent and the child process as CoW by setting a
special bit in the page table entry which describes the
mapping between virtual and physical addresses of the MMU. As for
invalid memory accesses, the attempt to write to a CoW page results
in a page fault that puts the CPU back into kernel mode. The kernel
then allocates a new memory page on behalf of the process, copies
the contents of the page which caused the fault, changes the page
table mappings for the process accordingly and returns to user space.
This all happens transparently to the process.
The diagram on the left illustrates the layout of the virtual
address space of a process. At the top of the address space are the
argument vector and the environment variables. The stack
stores the local variables of the functions which are currently
being called, plus house-keeping data like the return addresses
of these functions. As more functions are called, the stack grows
downwards towards the lower addresses. Its current lower end is
called the base pointer. The other variable area of the
address space is the heap, which contains the memory that
has been allocated on behalf of the process, for example with
malloc(3)
. As the process allocates more memory, the heap grows
upwards, towards the stack. The current end of the heap is called the
break point. The lower part of the address space contains
three segments of fixed size. The text segment contains the
compiled machine instructions of the executable, the data
segment contains the initialized variables which are already known
at compile time. Finally, the BSS segment is allocated and
zeroed at execution time. This segment contains variables which should
initialized to zero at startup. Unlike the data segment it is not
stored in the executable. BSS stands for "Block Started by Symbol",
which is a historic name coined in the 1950s. It has no relation to
the real meaning of the segment.
htop
, ps
ux
and pstree -panuch $LOGNAME
. ls -l /proc/$$
and examine the environment of
your shell process. kill -l
and discuss the meaning of signals
1-15. Use signal(7)
as a reference. sleep 100&
. From
another terminal, send SIGSTOP
to the parent process
of the sleep process (the shell), then send SIGKILL
to the sleep process. Run cat /proc/$PID/status
where
$PID
is the process ID of the sleep process. echo $$
to obtain the PID of an interactive
shell that is running in a terminal. Send the SIGSTOP
and SIGCONT
signals to this PID from another terminal
and see what happens when you type in the terminal that contains the
stopped shell process. ping(8)
utility catches SIGQUIT
.
In one terminal execute ping localhost
. While this
command runs in an endless loop, send SIGQUIT
to the
ping process from another terminal and see what happens. kill(2)
to learn what kill -9 -1
does. Run this command if you are brave. X=foo
bar
and X=foo; bar
. set
and examine the environment variables of
an interactive shell session. ls / /does-not-exist
, redirect
stdout and stderr to different files. echo foo; echo bar 1>&2
. Which
circumstances guarantee that the "foo" line appears before the "bar"
line in the output? foo | bar
, what is the
buffering type of the file descriptor which corresponds to
the stdin
stream of the bar
process? foo
is a log file which increases due to
some process appending data to it. Explain why the command
tail -f foo | while read; do echo new data; done
does not
work as expected. Fix the flaw by changing the buffering type with
stdbuf(1)
. sleep 100 > /dev/null &
, examine open
files of the sleep process by looking at suitable files in
/proc
. Do the same with sleep 100 | head
&
. ldd /bin/sh
and explain what happens when a
shell is executed. cat /proc/$$/maps
or
pmap -x $$
to see the address space layout of your
shell. Check Documentation/filesystems/proc.txt
in the linux kernel source tree for the format of
/proc/$$/maps
. cat /proc/$$/smaps
and examine the values of
the heap section. PR_SET_CHILD_SUBREAPER
works and possible
use-cases for this (Linux) feature.
pthreads(7)
) and explain how the Linux-specific
clone(2)
system call can used to implement threads.
find /etc > /dev/null
does,
and why you get some error messages. Assume you'd like to extract
only those error messages which contain the string "lvm". Explain
why find /etc > /dev/null | grep lvm
does not work as
expected. Come up with a similiar command that works.
/etc
directory recursively and
prints all files and directories it encounters during the traversal to
stdout. Since stdout is redirected to the NULL device by the >
/dev/null
construct, only the stderr stream containing the error
messages makes it to the terminal. This includes all subdirectories
of /etc
which cannot be traversed due to insufficient
permissions (no "r" bit set). The proposed find | grep
command does not work since the |
operator is evaluated
before any redirections specified by the find command
take place. More precisely, stdout of the find process is redirected
twice: First to one end of the pipe due to the |
,
then to the NULL device due to the > /dev/null
. The
last redirection "wins", so the grep
process does not
see any input. The command find /etc 2>&1 > /dev/null | grep
lvm
works. The following four redirections take place: First
stdout of the find
process and stdin of grep
process are redirected to the two ends of the pipe. Next, due to
the 2>&1
the stderr stream of the find
process is redirected to the current destination of stdout, i.e.,
to the pipe. Finally the > /dev/null
redirects stdout
of the find process to the NULL device. Hence error messages go to
the pipe and are processed by grep
as desired.
ulimit -n
to see the maximal number of file descriptors you
are allowed to create. Explain what this limit means with respect
to multiple processes, multiple logins, and the fork(2
) system
call. Write a program in your language of choice which creates file
descriptors in a loop until it fails due to the file descriptor
limit. Then print the number of file descriptors the program was able
to create.
fork(
) to create a new process, the new
process inherits all open file descriptors from the parent. This can
lead to the situation where a newly created process is unable to open
any files. This property was actually used to break computer
security. The O_CLOEXEC
flag was introduced not too long
ago to deal with this problem. See open(2
) for details.
C program that opens the maximal possible number of file descriptors:
int main(void) { int i; for (i == 0; open("/dev/null", O_RDONLY) >= 0; i++) ; printf("opened %d file descriptors\n", i); exit(0); }
vm/overcommit-accounting
. Discuss the pros and cons of
the three possible overcommit handling modes.
bash -c
'cat /proc/$$/maps
'
repeatedly to see address space layout randomization in action. Discuss
the pros and cons of ASLR. #!/bin/sh echo "changing CWD to $1" cd "$1"
#!/bin/sh echo "hello world"
#!/bin/sh mkdir foo touch foo/a ln -s ../foo foo/testdir ls -l foo/a foo/testdir/a foo/testdir/testdir/a