LKD3 Process Management

Chapter 3: Process Management

preview

What is ‘process’; what do processes do from treir birth to death; and how to manage them.


The process

  • A process is an active program, also including:
    • open files
    • pending signals
    • internal kernel data
    • processor state
    • memory address space
    • one or more threads of execution(threads)
    • data section containing global variables
  • For Linux, a thread is a kind of process(not correct now?)
  • processes provide virtualized processor and virtualized memory
  • Threads share the virtual memory, but receives each own virtualized processor
  • Process’s birth to death:
    • fork() (via syscall clone())
    • exec() to execute a new, diff program
    • exit() to terminate and free its resources -> become a zombie thread until parent’s wait()
    • wait4() to wait child thread to finish

Process Descriptor and the Task Structure

several data structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
//used to save process descriptor 
// <linux/sched.h> : 1170
struct task_struct {
volatile long state;
...}

//used to save thread info
//<asm/thread_info.h> : 26
struct thread_info {
struct task_struct *task; /* main task structure */
struct exec_domain *exec_domain; /* execution domain */
__u32 flags; /* low level flags */
__u32 status; /* thread synchronous flags */
__u32 cpu; /* current CPU */
int preempt_count; /* 0 => preemptable,
<0 => BUG */
mm_segment_t addr_limit;
struct restart_block restart_block;
void __user *sysenter_return;
#ifdef CONFIG_X86_32
unsigned long previous_esp; /* ESP of the previous stack in
case of nested (IRQ) stacks
*/
__u8 supervisor_stack[0];
#endif
int uaccess_err;
};

thread_info is allocated at the end of its kernel stack, with a pointer to the process descriptor.

PID

PID is used to identify process. The default value is 32768(short int), or 4*1024*1024. Function current_thread_info can find the thread_info and task_struct.

1
2
3
#current_thread_info
movl $-8192, %eax #assuming the stack size is 8KB
andl %esp, %eax

1
current_thread_info()->task;

Process State

five different state:(:183)

  • TASK_RUNNING:
    • running or on a runqueue
    • only state possible in user-space
    • also can be in kernel-space
  • TASK_INTERRUPTIBLE:

    • sleeping(blocked)
    • waiting a signal or condition to return to running
  • TASK_UNINTERRUPTIBLE:

    • same to interruptible except when receiving a signal it does not wake up
    • since it must be run without interrupt
    • less often used than interruptible
  • __TASK_TRACED:
    • being trace by another process, such as debugger
  • __TASK_STOPPED:
    • execution has stopped
    • when receiving signal SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU or any signal while being debugged

Family Tree

All processes are descendants of the init process(PID 1). The start of init is the last step of boot process.

Each process has exactly one parent. We can find it by
struct task_struct *my_parent = current->parent;
We can also iterate siblings according to &current->children

The init process descriptor is &init_task

Task list is a circular, doubly linked list. We can find the next(previous) task:
list_entry(task->tasks.next, struct task_struct, tasks)
or macro
next_task(task)

The macro for_each_process(task) is provided to iterate over the entire task list. Note: this is expensive.


Process Creation

diff between fork() child process and parent process:

  • PID & PPID
  • some resources, such as pending signals

By copy on write pages, after fork() child process shares the same copy with parent process to avoid resouce copy waste.

So the overhead in fork() :

  1. the duplication of the parent’s page table
  2. the creation of a unique process descriptor

Forking

fork(), vfork(), __clone() --> clone() --> do_fork()

do_fork() calls copy_process() and then starts the process running.

  1. call dup_task_struct() to create a new kernel stack, thread_info and task_struct
  2. check the child not exceed the resources limits for the current user
  3. initial some members of process descriptor
  4. state set to TASK_UNINTERRUPTIBLE
  5. call copy_flags() to update flags member
  6. call alloc_pid() to assign a new PID
  7. duplicate or share open files, filesystem information, signal handlers, process address space, and namespace
  8. after copy_process() returns successfully, the new child is waked up and run first. In common case, the child calls exec() immediately, eliminating COW overhead.

vfork() does not copy the page table entries of the parent process. The child executes in parent’s address space and the parent is blocked until the child calls exec() or exits. (Maybe rarely used now?)

task_struct->vfork_done is pointed to a specific space. The parent waits for the child to signal it through the pointer. In mm_released(), where a task exits a memory address space, vfork_done is checked to signal the parent.


The Linux Implementation of Threads

In Linux, each thread has a task_struct and appears as a normal process. Threads just happen to share some resources with other processes.

  • The creation of a thread: clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
  • The creation of a normal fork(): clone(SIGCHLD, 0);
  • The creation of a normal vfork(): clone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);

kernel thread

kernel threads do not have an address space. They only operate in kernel-space and do not context switch. They are schedulable and preemptable.

Through macro kthread_run() the new task can be created.

1
2
3
4
5
6
7
8
9
#define kthread_run(threadfn, data, namefmt, ...)               \
({ \
struct task_struct *k; \
\
k = kthread_create(threadfn, data, namefmt, ## __VA_ARGS__);\
if (!IS_ERR(k)) \
wake_up_process(k) \
k; \
})

a kernel thread continues until it calls exit() or kthread_stop() is called:

1
int kthread_stop(struct task_struct *k)


Process Termination

: do_exit():

  1. set the PF_EXITING flag of the task_struct
  2. call acct_update_integrals() to write out accounting information
  3. call del_timer_sync() to remove any kernel timers
  4. call exit_mm() to release mm_struct()
  5. call exit_sem() to dequeue IPC semaphore
  6. call exit_files() & exit_fs()
  7. set exit code in task_struct, parent process may need it.
  8. call exit_notify() to send signals to the task’s parent, reparents its children to another thread in their thread group or the init process, set the task’s exit state to EXIT_ZOMBIE in exit_state in the task_struct
  9. call schedule(). The process will not ever execute. do_exit() never returns.

    Now the process still occupies its kernel stack and the thread_info and task_struct. After the parent retrieves the info or notifies the kernel that it is uninterested, the memory will be freed.

    The wait() family of functions are implemented via a single syscall wait4(). It block the calling task until one of its children exits, with the PID of the exited child. A pointer is provided holds the exit code.

    release_task() is invoked to deallocate the process descriptor.

    1. remove the process from the pidhash and remove the process from the task list:
      release_task() --> __exit_signal() --> unhashed_process --> detach_pid()
    2. __exit_signal() also release remaining resources and finalizes statistics and bopkkeeping
    3. if the task was the last member of a group, and a leader is a zombie, then release_task() notifies the leader’s parent.
    4. call put_task_struct() to free the pages containing the process’s kernel stack and thread_info and deallocate the slab cache containing the task_struct

When those who have children exit, we need to reparent the child task to another process in the same group or the init process.
do_exit() --> exit_notify() --> forget_original_parent() -->find_new_reaper()

  • In find_new_reaper() we first try to find a reaper in the same group, if failed return the init process.
  • In forget_original_parent() we then reparent all the child processes to the reaper.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    //<kernel/exit.c>:785 forget_original_parent()
    write_lock_irq(&tasklist_lock);
    reaper = find_new_reaper(father);

    list_for_each_entry_safe(p, n, &father->children, sibling) {
    struct task_struct *t = p;
    do {
    t->real_parent = reaper;
    if (t->parent == father) {
    BUG_ON(task_ptrace(t));
    t->parent = t->real_parent;
    }
    if (t->pdeath_signal)
    group_send_sig_info(t->pdeath_signal,
    SEND_SIG_NOINFO, t);
    } while_each_thread(p, t);
    reparent_leader(father, p, &dead_children);
    }
    write_unlock_irq(&tasklist_lock);

Conclusion

  • What is a process, the relationship between process and thread.
  • How Linux stores and represents processes: task_struct and thread_info
  • How processes are created: do_fork()
  • How images loaded into address spaces: exec()
  • How processes die: do_exit()

—-

Chaptor 4: Process Scheduling

preview

  • cooperative multitasking
  • preemptive multitasking

In this chaptor, we first discuss the fundamentals of scheduler design, how they apply to the Completely Fair Scheduler(CFS), and then its goal, design, implementation, algorithms, related syscall. Classic $O(1)$ scheduler model is also discussed.


Policy

Processes can be classified as:

  • I/O bound
  • processor bound

two seperate priority ranges:

  • nice value: from -20 to 19, default 0, less value corresponding to higher priority thus larger proportionof the system’s processor.
    By command ps -el a list of the processes and their nice values can be found
  • real-time priority: from 0 to 99, configurable by user or kernel. Higher values corresponds to a greater proportion.
    By command ps -eo state,uid,pid,ppid,rtprio,time,comm a list of the processes and their real-time priority shows.

On Linux, the amount of processor time is a function of the load of the system. And the proportion is also affected by each process’s nice value. In Linux, under the new CFS scheduler, the decision which process to run is a function of how much of a proportion of the processor the newly runnable processor has consumed.


The Linux Scheduling Algorithm

Some problems in Unix Process Scheduling:

  1. directly mapping nice values to timeslices lead to suboptimal switching behavior. i.e. 2 processes with nice value 19, each gets a timeslice of 5 ms. Then we need to make context switch each 5ms.
  2. “nicing down a process by one” has wildly different effects depending on the starting nice value. i.e.: 0 —> 100ms; 1 —> 95ms; (almost similar) 18 —> 10ms; 19 —> 5ms;(twice)
  3. timeslice should be related to timer tick.
  4. sometimes for interactive tasks you have to give a priority boost to wake up them.

Fair Scheduling

Instead of using the nice value to calculate a timeslice, CFS uses the nice value to weight the proportion of processor a process is to receive.

Targeted latency is the duration of the total timeslices.

CFS keeps a floor(minimum granularity) of the minimum timeslice.

The nice values, instead of yielding additive increases to timeslices, yield geometric differences.


The Linux Scheduling Implementation

Time Accounting

CFS uses struct sched_entity to keep track of process accounting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
//<linux/sched.h>: 1090
struct sched_entity {
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;

u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;

u64 last_wakeup;
u64 avg_overlap;

u64 nr_migrations;

u64 start_runtime;
u64 avg_wakeup;

/* CONFIG_SCHEDSTATS SETTING
* ... */
};

vruntime stores the weighted runtime of a process. Function update_curr() manages this accounting. This function is invoked periodically by the system timer and whenever a process becomes runnable or blocks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
//<kernel/sched_fair.c>: 502
//update_curr() --> __update_curr()
static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
unsigned long delta_exec)
{
unsigned long delta_exec_weighted;

schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq, exec_clock, delta_exec);
delta_exec_weighted = calc_delta_fair(delta_exec, curr);

curr->vruntime += delta_exec_weighted;
update_min_vruntime(cfs_rq);
}

Process Selection

The process with the smallest vruntime will run.

CFS uses rbtree to manage the list of runnable processes according to their key vruntime.

By function pick_next_entity() we can find the next process to run. Notice that the next entity should be cached at cfs_rq->rb_leftmost.

1
2
3
4
5
6
7
8
9
10
11
//  <kernel/sched_fair.c>
// pick_next_entity() --> __pick_next_entity()
static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *left = cfs_rq->rb_leftmost;

if (!left)
return NULL; //in this case CFS schedules the idle task

return rb_entry(left, struct sched_entity, run_node);
}

enqueue_entity() and dequeue_entity() can add or remove processes to the tree, but the actual work is finished in __enqueue_entity() and __dequeue_entity(). Remember to update the vruntime in these two functions.

The Scheduler Entry Point