Notes from Linux Kernel Development

Reading Linux Kernel Development (Third Edition) by Robert Love. Some notes from this awesome book.

Notes from Linux Kernel Development

1 Introduction to the Linux Kernel

Activities of Linux Processes

These contexts represent the breadth of the kernel’s activities. In fact, in Linux, we can generalize that each processor is doing exactly one of three things at any given moment:

  • In user-space, executing user code in a process
  • In kernel-space, in process context, executing on behalf of a specific process
  • In kernel-space, in interrupt context, not associated with a process, handling an interrupt

This list is inclusive. Even corner cases fit into one of these three activities: For example, when idle, it turns out that the kernel is executing an idle process in process context in the kernel.

Monolithic Kernel Versus Microkernel Designs

Linux is a monolithic kernel; that is, the Linux kernel executes in a single address space entirely in kernel mode. Linux, however, borrows much of the good from microkernels: Linux boasts a modular design, the capability to preempt itself (called kernel preemption), support for kernel threads, and the capability to dynamically load separate binaries (kernel modules) into the kernel image. Conversely, Linux has none of the performance-sapping features that curse microkernel design: Everything runs in kernel mode, with direct function invocation not message passing—the modus of communication. Nonetheless, Linux is modular, threaded, and the kernel itself is schedulable. Pragmatism wins again.

Linux Versus Classic Unix Kernels

As Linus and other kernel developers contribute to the Linux kernel, they decide how best to advance Linux without neglecting its Unix roots (and, more important, the Unix API). Consequently, because Linux is not based on any specific Unix variant, Linus and company can pick and choose the best solution to any given problem—or at times, invent new solutions! A handful of notable differences exist between the Linux kernel and classic Unix systems:

  • Linux supports the dynamic loading of kernel modules. Although the Linux kernel is monolithic, it can dynamically load and unload kernel code on demand.

  • Linux has symmetrical multiprocessor (SMP) support. Although most commercial variants of Unix now support SMP, most traditional Unix implementations did not.

  • The Linux kernel is preemptive. Unlike traditional Unix variants, the Linux kernel can preempt a task even as it executes in the kernel. Of the other commercial Unix implementations, Solaris and IRIX have preemptive kernels, but most Unix kernels are not preemptive.

  • Linux takes an interesting approach to thread support: It does not differentiate between threads and normal processes. To the kernel, all processes are the same— some just happen to share resources.

  • Linux provides an object-oriented device model with device classes, hot-pluggable events, and a user-space device filesystem (sysfs).

  • Linux ignores some common Unix features that the kernel developers consider poorly designed, such as STREAMS, or standards that are impossible to cleanly implement.

  • Linux is free in every sense of the word. The feature set Linux implements is the result of the freedom of Linux’s open development model. If a feature is without merit or poorly thought out, Linux developers are under no obligation to implement it. To the contrary, Linux has adopted an elitist attitude toward changes: Modifications must solve a specific real-world problem, derive from a clean design, and have a solid implementation. Consequently, features of some other modern Unix variants that are more marketing bullet or one-off requests, such as pageable kernel memory, have received no consideration.

Despite these differences, however, Linux remains an operating system with a strong Unix heritage.

2 Getting Started with the Kernel

The Kernel Source Tree

Directories in the Root of the Kernel Source Tree

Directory Description
arch Architecture-specific source
block Block I/O layer
crypto Crypto API
Documentation Kernel source documentation
drivers Device drivers
firmware Device firmware needed to use certain drivers
fs The VFS and the individual filesystems
include Kernel headers
init Kernel boot and initialization
ipc Interprocess communication code
kernel Core subsystems, such as the scheduler
lib Helper routines
mm Memory management subsystem and the VM
net Networking subsystem
samples Sample, demonstrative code
scripts Scripts used to build the kernel
security Linux Security Module
sound Sound subsystem
usr Early user-space code (called initramfs)
tools Tools helpful for developing Linux
virt Virtualization infrastructure

Configuring the Kernel

text-based command-line utility:

1
$ make config

ncurses-based graphical utility:

1
$ make menuconfig

gtk+-based graphical utility:

1
$ make gconfig

A Beast of a Different Nature

These characteristics make the kernel a beast of a different nature. Some of the usual rules are bent; other rules are entirely new. Although some of the differences are obvious (we all know the kernel can do anything it wants), others are not so obvious. The most important of these differences are

  • The kernel has access to neither the C library nor the standard C headers.
  • The kernel is coded in GNU C.
  • The kernel lacks the memory protection afforded to user-space.
  • The kernel cannot easily execute floating-point operations.
  • The kernel has a small per-process fixed-size stack.
  • Because the kernel has asynchronous interrupts, is preemptive, and supports SMP, synchronization and concurrency are major concerns within the kernel.
  • Portability is important.

3 Process Management

The Process

A process is a program (object code stored on some media) in the midst of execution. Processes are, however, more than just the executing program code (often called the text section in Unix).They also include a set of resources such as open files and pending signals, internal kernel data, processor state, a memory address space with one or more memory mappings, one or more threads of execution, and a data section containing global variables. Processes, in effect, are the living result of running program code. The kernel needs to manage all these details efficiently and transparently.

Threads of execution, often shortened to threads, are the objects of activity within the process. Each thread includes a unique program counter, process stack, and set of processor registers. The kernel schedules individual threads, not processes. In traditional Unix systems, each process consists of one thread. In modern systems, however, multithreaded programs—those that consist of more than one thread—are common. As you will see later, Linux has a unique implementation of threads: It does not differentiate between threads and processes. To Linux, a thread is just a special kind of process.

Another name for a process is a task. The Linux kernel internally refers to processes as tasks.

Process Descriptor and the Task Structure

The kernel stores the list of processes in a circular doubly linked list called the task list. Each element in the task list is a process descriptor of the type struct task_struct, which is defined in<linux/sched.h>.The process descriptor contains all the information about a specific process.

The task_struct is a relatively large data structure, at around 1.7 kilobytes on a 32-bit machine. This size, however, is quite small considering that the structure contains all the information that the kernel has and needs about a process. The process descriptor contains the data that describes the executing program open files, the process’s address space, pending signals, the process’s state, and much more (see Figure 3.1).

Allocating the Process Descriptor

The task_struct structure is allocated via the slab allocator to provide object reuse and cache coloring (see Chapter 12). Prior to the 2.6 kernel series, struct task_struct was stored at the end of the kernel stack of each process. This allowed architectures with few registers, such as x86, to calculate the location of the process descriptor via the stack pointer without using an extra register to store the location. With the process descriptor now dynamically created via the slab allocator, a new structure, struct thread_info, was created that again lives at the bottom of the stack (for stacks that grow down) and at the top of the stack (for stacks that grow up). See Figure 3.2.

The thread_info structure is defined on x86 in <asm/thread_info.h> as

1
2
3
4
5
6
7
8
9
10
11
12
struct thread_info {
struct task_struct *task;
struct exec_domain *exec_domain;
__u32 flags;
__u32 status;
__u32 cpu;
int preempt_count;
mm_segment_t addr_limit;
struct restart_block restart_block;
void *sysenter_return;
int uaccess_err;
};

Each task’s thread_info structure is allocated at the end of its stack. The task element of the structure is a pointer to the task’s actual task_struct.

Storing the Process Descriptor

The system identifies processes by a unique process identification value or PID. The PID is a numerical value represented by the opaque type pid_t, which is typically an int. Because of backward compatibility with earlier Unix and Linux versions, however, the default maximum value is only 32,768 (that of a short int, although the value optionally can be increased as high as four million (this is controlled in <linux/threads.h>.The kernel stores this value as pid inside each process descriptor. If the system is willing to break compatibility with old applications, the administrator may increase the maximum value via /proc/sys/kernel/pid_max.

current dereferences the task member of thread_info to return the task_struct:

1
current_thread_info()->task;

Process State

  • TASK_RUNNING
    • The process is runnable; it is either currently running or on a runqueue waiting to run.
  • ASK_INTERRUPTIBLE
    • The process is sleeping (that is, it is blocked), waiting for some condition to exist. When this condition exists, the kernel sets the process’s state to TASK_RUNNING.
  • TASK_UNINTERRUPTIBLE
    • This state is identical to TASK_INTERRUPTIBLE except that it does not wake up and become runnable if it receives a signal.
  • __TASK_TRACED
    • The process is being traced by another process, such as a debugger, via ptrace.
  • __TASK_STOPPED
    • Process execution has stopped; the task is not running nor is it eligible to run. This occurs if the task receives the SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU signal or if it receives any signal while it is being debugged.

Manipulating the Current Process State

Kernel code often needs to change a process’s state. he preferred mechanism is using

1
set_task_state(task, state); /* set task ‘task’ to state ‘state’ */

The method set_current_state(state) is synonymous to set_task_state(current, state). See <linux/sched.h> for the implementation of these and related functions.

The Process Family Tree

All processes are descendants of the init process, whose PID is one. The kernel starts init in the last step of the boot process. The init process, in turn, reads the system initscripts and executes more programs, eventually completing the boot process.

Process Creation

Most operating systems implement a spawn mechanism to create a new process in a new address space, read in an executable, and begin executing it. Unix takes the unusual approach of separating these steps into two distinct functions: fork()and exec().

The first, fork(), creates a child process that is a copy of the current task. It differs from the parent only in its PID (which is unique)

The second function, exec(), loads a new executable into the address space and begins executing it.

Copy-on-Write

In Linux, fork() is implemented through the use of copy-on-write pages.

Forking

Linux implementsfork() via the clone() system call.

This call takes a series of flags that specify which resources, if any, the parent and child process should share. (See “The Linux Implementation of Threads” section later in this chapter for more about the flags.) The fork(), vfork(), and __clone() library calls all invoke the clone() system call with the requisite flags. The clone() system call, in turn, calls do_fork().

vfork()

The vfork() system call has the same effect as fork(), except that the page table entries of the parent process are not copied. Instead, the child executes as the sole thread in the parent’s address space, and the parent is blocked until the child either calls exec() or exits. The child is not allowed to write to the address space.