Notes from Linux Kernel Development
Reading Linux Kernel Development (Third Edition) by Robert Love. Some notes from this awesome book.
Notes from Linux Kernel Development
1 Introduction to the Linux Kernel
Activities of Linux Processes
These contexts represent the breadth of the kernel’s activities. In fact, in Linux, we can generalize that each processor is doing exactly one of three things at any given moment:
- In user-space, executing user code in a process
- In kernel-space, in process context, executing on behalf of a specific process
- In kernel-space, in interrupt context, not associated with a process, handling an interrupt
This list is inclusive. Even corner cases fit into one of these three activities: For example, when idle, it turns out that the kernel is executing an idle process in process context in the kernel.
Monolithic Kernel Versus Microkernel Designs
Linux is a monolithic kernel; that is, the Linux kernel executes in a single address space entirely in kernel mode. Linux, however, borrows much of the good from microkernels: Linux boasts a modular design, the capability to preempt itself (called kernel preemption), support for kernel threads, and the capability to dynamically load separate binaries (kernel modules) into the kernel image. Conversely, Linux has none of the performance-sapping features that curse microkernel design: Everything runs in kernel mode, with direct function invocation not message passing—the modus of communication. Nonetheless, Linux is modular, threaded, and the kernel itself is schedulable. Pragmatism wins again.
Linux Versus Classic Unix Kernels
As Linus and other kernel developers contribute to the Linux kernel, they decide how best to advance Linux without neglecting its Unix roots (and, more important, the Unix API). Consequently, because Linux is not based on any specific Unix variant, Linus and company can pick and choose the best solution to any given problem—or at times, invent new solutions! A handful of notable differences exist between the Linux kernel and classic Unix systems:
Linux supports the dynamic loading of kernel modules. Although the Linux kernel is monolithic, it can dynamically load and unload kernel code on demand.
Linux has symmetrical multiprocessor (SMP) support. Although most commercial variants of Unix now support SMP, most traditional Unix implementations did not.
The Linux kernel is preemptive. Unlike traditional Unix variants, the Linux kernel can preempt a task even as it executes in the kernel. Of the other commercial Unix implementations, Solaris and IRIX have preemptive kernels, but most Unix kernels are not preemptive.
Linux takes an interesting approach to thread support: It does not differentiate between threads and normal processes. To the kernel, all processes are the same— some just happen to share resources.
Linux provides an object-oriented device model with device classes, hot-pluggable events, and a user-space device filesystem (
sysfs
).Linux ignores some common Unix features that the kernel developers consider poorly designed, such as STREAMS, or standards that are impossible to cleanly implement.
Linux is free in every sense of the word. The feature set Linux implements is the result of the freedom of Linux’s open development model. If a feature is without merit or poorly thought out, Linux developers are under no obligation to implement it. To the contrary, Linux has adopted an elitist attitude toward changes: Modifications must solve a specific real-world problem, derive from a clean design, and have a solid implementation. Consequently, features of some other modern Unix variants that are more marketing bullet or one-off requests, such as pageable kernel memory, have received no consideration.
Despite these differences, however, Linux remains an operating system with a strong Unix heritage.
2 Getting Started with the Kernel
The Kernel Source Tree
Directories in the Root of the Kernel Source Tree
Directory | Description |
---|---|
arch | Architecture-specific source |
block | Block I/O layer |
crypto | Crypto API |
Documentation | Kernel source documentation |
drivers | Device drivers |
firmware | Device firmware needed to use certain drivers |
fs | The VFS and the individual filesystems |
include | Kernel headers |
init | Kernel boot and initialization |
ipc | Interprocess communication code |
kernel | Core subsystems, such as the scheduler |
lib | Helper routines |
mm | Memory management subsystem and the VM |
net | Networking subsystem |
samples | Sample, demonstrative code |
scripts | Scripts used to build the kernel |
security | Linux Security Module |
sound | Sound subsystem |
usr | Early user-space code (called initramfs) |
tools | Tools helpful for developing Linux |
virt | Virtualization infrastructure |
Configuring the Kernel
text-based command-line utility:
1 | $ make config |
ncurses-based graphical utility:
1 | $ make menuconfig |
gtk+-based graphical utility: 1
$ make gconfig
A Beast of a Different Nature
These characteristics make the kernel a beast of a different nature. Some of the usual rules are bent; other rules are entirely new. Although some of the differences are obvious (we all know the kernel can do anything it wants), others are not so obvious. The most important of these differences are
- The kernel has access to neither the C library nor the standard C headers.
- The kernel is coded in GNU C.
- The kernel lacks the memory protection afforded to user-space.
- The kernel cannot easily execute floating-point operations.
- The kernel has a small per-process fixed-size stack.
- Because the kernel has asynchronous interrupts, is preemptive, and supports SMP, synchronization and concurrency are major concerns within the kernel.
- Portability is important.
3 Process Management
The Process
A process is a program (object code stored on some media) in the midst of execution. Processes are, however, more than just the executing program code (often called the text section in Unix).They also include a set of resources such as open files and pending signals, internal kernel data, processor state, a memory address space with one or more memory mappings, one or more threads of execution, and a data section containing global variables. Processes, in effect, are the living result of running program code. The kernel needs to manage all these details efficiently and transparently.
Threads of execution, often shortened to threads, are the objects of activity within the process. Each thread includes a unique program counter, process stack, and set of processor registers. The kernel schedules individual threads, not processes. In traditional Unix systems, each process consists of one thread. In modern systems, however, multithreaded programs—those that consist of more than one thread—are common. As you will see later, Linux has a unique implementation of threads: It does not differentiate between threads and processes. To Linux, a thread is just a special kind of process.
Another name for a process is a task. The Linux kernel internally refers to processes as tasks.
Process Descriptor and the Task Structure
The kernel stores the list of processes in a circular doubly
linked list called the task list. Each element in the task list
is a process descriptor of the type
struct task_struct
, which is defined
in<linux/sched.h>
.The process descriptor contains all
the information about a specific process.
The task_struct
is a relatively large data structure, at
around 1.7 kilobytes on a 32-bit machine. This size, however, is quite
small considering that the structure contains all the
information that the kernel has and needs about a process. The
process descriptor contains the data that describes the executing
program open files, the process’s address space, pending signals, the
process’s state, and much more (see Figure 3.1).
Allocating the Process Descriptor
The task_struct
structure is allocated via the slab
allocator to provide object reuse and cache coloring (see Chapter
12). Prior to the 2.6 kernel series, struct task_struct
was
stored at the end of the kernel stack of each process. This allowed
architectures with few registers, such as x86, to calculate the location
of the process descriptor via the stack pointer without using an extra
register to store the location. With the process descriptor now
dynamically created via the slab allocator, a new structure,
struct thread_info
, was created that again lives at the
bottom of the stack (for stacks that grow down) and at the top of the
stack (for stacks that grow up). See Figure 3.2.
The thread_info
structure is defined on x86 in
<asm/thread_info.h>
as
1 | struct thread_info { |
Each task’s thread_info
structure is allocated at the
end of its stack. The task element of the structure is a pointer to the
task’s actual task_struct
.
Storing the Process Descriptor
The system identifies processes by a unique process
identification value or PID. The PID is a numerical value
represented by the opaque type pid_t
, which is typically an
int
. Because of backward compatibility with earlier Unix
and Linux versions, however, the default maximum value is only 32,768
(that of a short int
, although the value optionally can be
increased as high as four million (this is controlled in
<linux/threads.h>
.The kernel stores this value as
pid
inside each process descriptor. If the system is
willing to break compatibility with old applications, the administrator
may increase the maximum value via
/proc/sys/kernel/pid_max
.
current dereferences the task member of thread_info
to
return the task_struct
:
1 | current_thread_info()->task; |
Process State
TASK_RUNNING
- The process is runnable; it is either currently running or on a runqueue waiting to run.
ASK_INTERRUPTIBLE
- The process is sleeping (that is, it is blocked), waiting for some
condition to exist. When this condition exists, the kernel sets the
process’s state to
TASK_RUNNING
.
- The process is sleeping (that is, it is blocked), waiting for some
condition to exist. When this condition exists, the kernel sets the
process’s state to
TASK_UNINTERRUPTIBLE
- This state is identical to
TASK_INTERRUPTIBLE
except that it does not wake up and become runnable if it receives a signal.
- This state is identical to
__TASK_TRACED
- The process is being traced by another process, such as a debugger, via ptrace.
__TASK_STOPPED
- Process execution has stopped; the task is not running nor is it
eligible to run. This occurs if the task receives the
SIGSTOP
,SIGTSTP
,SIGTTIN
, orSIGTTOU
signal or if it receives any signal while it is being debugged.
- Process execution has stopped; the task is not running nor is it
eligible to run. This occurs if the task receives the
Manipulating the Current Process State
Kernel code often needs to change a process’s state. he preferred mechanism is using
1 | set_task_state(task, state); /* set task ‘task’ to state ‘state’ */ |
The method set_current_state(state)
is synonymous to
set_task_state(current, state)
. See
<linux/sched.h>
for the implementation of these and
related functions.
The Process Family Tree
All processes are descendants of the init
process, whose
PID is one. The kernel starts init
in the last step of the
boot process. The init
process, in turn, reads the system
initscripts and executes more programs, eventually completing
the boot process.
Process Creation
Most operating systems implement a spawn mechanism to create
a new process in a new address space, read in an executable, and begin
executing it. Unix takes the unusual approach of separating these steps
into two distinct functions: fork()
and
exec()
.
The first, fork()
, creates a child process that is a
copy of the current task. It differs from the parent only in its PID
(which is unique)
The second function, exec()
, loads a new executable into
the address space and begins executing it.
Copy-on-Write
In Linux, fork()
is implemented through the use of
copy-on-write pages.
Forking
Linux implementsfork()
via the clone()
system call.
This call takes a series of flags that specify which resources, if
any, the parent and child process should share. (See “The Linux
Implementation of Threads” section later in this chapter for more about
the flags.) The fork()
, vfork()
, and
__clone()
library calls all invoke the clone()
system call with the requisite flags. The clone() system call, in turn,
calls do_fork()
.
vfork()
The vfork()
system call has the same effect as
fork()
, except that the page table entries of the parent
process are not copied. Instead, the child executes as the sole thread
in the parent’s address space, and the parent is blocked until the child
either calls exec()
or exits. The child is not
allowed to write to the address space.