Chapter 3: Process Management
preview
What is ‘process’; what do processes do from treir birth to death; and how to manage them.
The process
- A process is an active program, also including:
- open files
- pending signals
- internal kernel data
- processor state
- memory address space
- one or more threads of execution(threads)
- data section containing global variables
- For Linux, a thread is a kind of process(not correct now?)
- processes provide virtualized processor and virtualized memory
- Threads share the virtual memory, but receives each own virtualized processor
- Process’s birth to death:
- fork() (via syscall clone())
- exec() to execute a new, diff program
- exit() to terminate and free its resources -> become a zombie thread until parent’s wait()
- wait4() to wait child thread to finish
Process Descriptor and the Task Structure
several data structure:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27//used to save process descriptor
// <linux/sched.h> : 1170
struct task_struct {
volatile long state;
...}
//used to save thread info
//<asm/thread_info.h> : 26
struct thread_info {
struct task_struct *task; /* main task structure */
struct exec_domain *exec_domain; /* execution domain */
__u32 flags; /* low level flags */
__u32 status; /* thread synchronous flags */
__u32 cpu; /* current CPU */
int preempt_count; /* 0 => preemptable,
<0 => BUG */
mm_segment_t addr_limit;
struct restart_block restart_block;
void __user *sysenter_return;
unsigned long previous_esp; /* ESP of the previous stack in
case of nested (IRQ) stacks
*/
__u8 supervisor_stack[0];
int uaccess_err;
};
thread_info is allocated at the end of its kernel stack, with a pointer to the process descriptor.
PID
PID is used to identify process. The default value is 32768(short int), or 4*1024*1024. Function current_thread_info can find the thread_info and task_struct.1
2
3#current_thread_info
movl $-8192, %eax #assuming the stack size is 8KB
andl %esp, %eax
1 | current_thread_info()->task; |
Process State
five different state:(
- TASK_RUNNING:
- running or on a runqueue
- only state possible in user-space
- also can be in kernel-space
TASK_INTERRUPTIBLE:
- sleeping(blocked)
- waiting a signal or condition to return to running
TASK_UNINTERRUPTIBLE:
- same to interruptible except when receiving a signal it does not wake up
- since it must be run without interrupt
- less often used than interruptible
- __TASK_TRACED:
- being trace by another process, such as debugger
- __TASK_STOPPED:
- execution has stopped
- when receiving signal SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU or any signal while being debugged
Family Tree
All processes are descendants of the init process(PID 1). The start of init is the last step of boot process.
Each process has exactly one parent. We can find it bystruct task_struct *my_parent = current->parent;
We can also iterate siblings according to ¤t->children
The init process descriptor is &init_task
Task list is a circular, doubly linked list. We can find the next(previous) task:list_entry(task->tasks.next, struct task_struct, tasks)
or macronext_task(task)
The macro for_each_process(task)
is provided to iterate over the entire task list. Note: this is expensive.
Process Creation
diff between fork()
child process and parent process:
- PID & PPID
- some resources, such as pending signals
By copy on write pages, after fork()
child process shares the same copy with parent process to avoid resouce copy waste.
So the overhead in fork()
:
- the duplication of the parent’s page table
- the creation of a unique process descriptor
Forking
fork(), vfork(), __clone() --> clone() --> do_fork()
do_fork()
calls copy_process()
and then starts the process running.
- call
dup_task_struct()
to create a new kernel stack,thread_info
andtask_struct
- check the child not exceed the resources limits for the current user
- initial some members of process descriptor
- state set to TASK_UNINTERRUPTIBLE
- call
copy_flags()
to update flags member - call
alloc_pid()
to assign a new PID - duplicate or share open files, filesystem information, signal handlers, process address space, and namespace
- after
copy_process()
returns successfully, the new child is waked up and run first. In common case, the child callsexec()
immediately, eliminating COW overhead.
vfork()
does not copy the page table entries of the parent process. The child executes in parent’s address space and the parent is blocked until the child calls exec()
or exits. (Maybe rarely used now?)
task_struct->vfork_done
is pointed to a specific space. The parent waits for the child to signal it through the pointer. In mm_released()
, where a task exits a memory address space, vfork_done
is checked to signal the parent.
The Linux Implementation of Threads
In Linux, each thread has a task_struct
and appears as a normal process. Threads just happen to share some resources with other processes.
- The creation of a thread:
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
- The creation of a normal
fork()
:clone(SIGCHLD, 0);
- The creation of a normal
vfork()
:clone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);
kernel thread
kernel threads do not have an address space. They only operate in kernel-space and do not context switch. They are schedulable and preemptable.
Through macro kthread_run()
the new task can be created.1
2
3
4
5
6
7
8
9
({ \
struct task_struct *k; \
\
k = kthread_create(threadfn, data, namefmt, ## __VA_ARGS__);\
if (!IS_ERR(k)) \
wake_up_process(k) \
k; \
})
a kernel thread continues until it calls exit()
or kthread_stop()
is called:1
int kthread_stop(struct task_struct *k)
Process Termination
- set the PF_EXITING flag of the
task_struct
- call
acct_update_integrals()
to write out accounting information - call
del_timer_sync()
to remove any kernel timers - call
exit_mm()
to releasemm_struct()
- call
exit_sem()
to dequeue IPC semaphore - call
exit_files()
&exit_fs()
- set exit code in
task_struct
, parent process may need it. - call
exit_notify()
to send signals to the task’s parent, reparents its children to another thread in their thread group or the init process, set the task’s exit state to EXIT_ZOMBIE inexit_state
in thetask_struct
call
schedule()
. The process will not ever execute.do_exit()
never returns.Now the process still occupies its kernel stack and the
thread_info
andtask_struct
. After the parent retrieves the info or notifies the kernel that it is uninterested, the memory will be freed.The
wait()
family of functions are implemented via a single syscallwait4()
. It block the calling task until one of its children exits, with the PID of the exited child. A pointer is provided holds the exit code.release_task()
is invoked to deallocate the process descriptor.- remove the process from the pidhash and remove the process from the task list:
release_task() --> __exit_signal() --> unhashed_process --> detach_pid()
__exit_signal()
also release remaining resources and finalizes statistics and bopkkeeping- if the task was the last member of a group, and a leader is a zombie, then
release_task()
notifies the leader’s parent. - call
put_task_struct()
to free the pages containing the process’s kernel stack andthread_info
and deallocate the slab cache containing thetask_struct
- remove the process from the pidhash and remove the process from the task list:
When those who have children exit, we need to reparent the child task to another process in the same group or the init process.do_exit() --> exit_notify() --> forget_original_parent() -->find_new_reaper()
- In
find_new_reaper()
we first try to find a reaper in the same group, if failed return the init process. - In
forget_original_parent()
we then reparent all the child processes to the reaper.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19//<kernel/exit.c>:785 forget_original_parent()
write_lock_irq(&tasklist_lock);
reaper = find_new_reaper(father);
list_for_each_entry_safe(p, n, &father->children, sibling) {
struct task_struct *t = p;
do {
t->real_parent = reaper;
if (t->parent == father) {
BUG_ON(task_ptrace(t));
t->parent = t->real_parent;
}
if (t->pdeath_signal)
group_send_sig_info(t->pdeath_signal,
SEND_SIG_NOINFO, t);
} while_each_thread(p, t);
reparent_leader(father, p, &dead_children);
}
write_unlock_irq(&tasklist_lock);
Conclusion
- What is a process, the relationship between process and thread.
- How Linux stores and represents processes:
task_struct
andthread_info
- How processes are created:
do_fork()
- How images loaded into address spaces:
exec()
- How processes die:
do_exit()
—-
Chaptor 4: Process Scheduling
preview
- cooperative multitasking
- preemptive multitasking
In this chaptor, we first discuss the fundamentals of scheduler design, how they apply to the Completely Fair Scheduler(CFS), and then its goal, design, implementation, algorithms, related syscall. Classic $O(1)$ scheduler model is also discussed.
Policy
Processes can be classified as:
- I/O bound
- processor bound
two seperate priority ranges:
- nice value: from -20 to 19, default 0, less value corresponding to higher priority thus larger proportionof the system’s processor.
By commandps -el
a list of the processes and their nice values can be found - real-time priority: from 0 to 99, configurable by user or kernel. Higher values corresponds to a greater proportion.
By commandps -eo state,uid,pid,ppid,rtprio,time,comm
a list of the processes and their real-time priority shows.
On Linux, the amount of processor time is a function of the load of the system. And the proportion is also affected by each process’s nice value. In Linux, under the new CFS scheduler, the decision which process to run is a function of how much of a proportion of the processor the newly runnable processor has consumed.
The Linux Scheduling Algorithm
Some problems in Unix Process Scheduling:
- directly mapping nice values to timeslices lead to suboptimal switching behavior. i.e. 2 processes with nice value 19, each gets a timeslice of 5 ms. Then we need to make context switch each 5ms.
- “nicing down a process by one” has wildly different effects depending on the starting nice value. i.e.: 0 —> 100ms; 1 —> 95ms; (almost similar) 18 —> 10ms; 19 —> 5ms;(twice)
- timeslice should be related to timer tick.
- sometimes for interactive tasks you have to give a priority boost to wake up them.
Fair Scheduling
Instead of using the nice value to calculate a timeslice, CFS uses the nice value to weight the proportion of processor a process is to receive.
Targeted latency is the duration of the total timeslices.
CFS keeps a floor(minimum granularity) of the minimum timeslice.
The nice values, instead of yielding additive increases to timeslices, yield geometric differences.
The Linux Scheduling Implementation
Time Accounting
CFS uses struct sched_entity
to keep track of process accounting.
1 | //<linux/sched.h>: 1090 |
vruntime
stores the weighted runtime of a process. Function update_curr()
manages this accounting. This function is invoked periodically by the system timer and whenever a process becomes runnable or blocks.
1 | //<kernel/sched_fair.c>: 502 |
Process Selection
The process with the smallest vruntime
will run.
CFS uses rbtree to manage the list of runnable processes according to their key vruntime
.
By function pick_next_entity()
we can find the next process to run. Notice that the next entity should be cached at cfs_rq->rb_leftmost
.1
2
3
4
5
6
7
8
9
10
11// <kernel/sched_fair.c>
// pick_next_entity() --> __pick_next_entity()
static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *left = cfs_rq->rb_leftmost;
if (!left)
return NULL; //in this case CFS schedules the idle task
return rb_entry(left, struct sched_entity, run_node);
}
enqueue_entity()
and dequeue_entity()
can add or remove processes to the tree, but the actual work is finished in __enqueue_entity()
and __dequeue_entity()
. Remember to update the vruntime in these two functions.
The Scheduler Entry Point