MIPS 进程切换分析

本报告代码基于Linux kernel 2.6.34

综述

Linux内核自2.5起使用$O(1)$调度器，由于其在存在大量交互进程的系统内表现不佳，在2.6.23之后“完全公平调度算法”（CFS）取代了$O(1)$调度器。

在一个抢占式内核中，除了进程可以自己yield进行切换，内核也可以抢占进程。在每一个进程的thread_info结构内存在一个变量need_resched，当这个变量被拉高时就意味着该进程应该被切换了。

只有内核可以进行抢占。因此对于一个用户进程，当其从中断或系统调用时的内核态返回用户态之前，内核会检查need_resched位，如果被拉高则进行进程切换。而对于大部分内核进程，只要重新调度是安全的，内核同样会进行调度。这里的安全指的是没有锁的占用。在thread_info中有一项preempt_count，进程每占用一个锁数值会加1，当其为0时就意味着可以占用。因此当中断处理之后，如果可以进行调度，内核就会考虑进行调度。当然该进程也可以显示调用schedule()或被阻塞后被动调用schedule()。

进程调度的入口位于kernel/sched.c:schedule()。在做好准备工作，确定了下一个运行的进程后，主要的工作由kernel/sched.c:context_switch()进行。这里面主要有两个工作。switch_mm()进行地址空间切换；switch_to()进行栈和寄存器状态切换。在switch_to()返回时，实际上已经是切换后的进程在运行了。

本报告的重点是context_switch()和switch_to()函数。前者是通用的函数，后者是由架构决定的函数。

Context Switch

context_switch()代码如下。其中的一些需要解释的位置有标注，在代码之后进行分析。

static inline void
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next) // ----(1)
{
	struct mm_struct *mm, *oldmm;

	prepare_task_switch(rq, prev, next);
	trace_sched_switch(rq, prev, next);
	mm = next->mm;
	oldmm = prev->active_mm;    // ----(2)
	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	if (likely(!mm)) {          // ----(3)
		next->active_mm = oldmm;
		atomic_inc(&oldmm->mm_count);
		enter_lazy_tlb(oldmm, next);        // ----(4)
	} else
		switch_mm(oldmm, mm, next);         // ----(5)

	if (likely(!prev->mm)) {                // ----(6)
		prev->active_mm = NULL;
		rq->prev_mm = oldmm;
	}
	/*
	 * Since the runqueue lock will be released by the next
	 * task (which is an invalid locking op but in the case
	 * of the scheduler it's an obvious special-case), so we
	 * do an early lockdep release here:
	 */
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);            // ----(7)

	barrier();          // ----(8)
	/*
	 * this_rq must be evaluated again because prev may have moved
	 * CPUs since it called schedule(), thus the 'rq' on its stack
	 * frame will be invalid.
	 */
	finish_task_switch(this_rq(), prev);        // ----(9)
}

调度器确定下一个进程后，context_switch()的参数就确定了。
- rq：发生调度的CPU的run queue
- prev：被抢占的进程（下称A）
- next：下一个进程（下称B）
mm变量指向B进程的地址空间描述符，oldmm变量指向A进程的当前正在使用的地址空间描述符（active_mm）。对于用户进程，其任务描述符（task_struct）的mm和active_mm相同，都是指向其进程地址空间。对于内核线程而言，其task_struct的mm成员为NULL（内核线程没有进程地址空间），但是，内核线程被调度执行的时候，总是需要一个进程地址空间，而active_mm就是指向它借用的那个进程地址空间。
mm为空的话，说明B进程是内核线程，这时候，只能借用A进程当前正在使用的那个地址空间（prev->active_mm）。这里不能借用A进程的地址空间（prev->mm），因为A进程也可能是一个内核线程，不拥有自己的地址空间描述符。
如果要切入的进程实际上是内核线程，那么我们也暂时不需要flush TLB，因为内核线程不会访问usersapce，所以那些无效的TLB entry也不会影响内核线程的执行。在这种情况下，调用体系结构相关的代码enter_lazy_tlb，标识该cpu进入lazy tlb mode。
只有要切入的B进程是一个普通进程的情况下（有自己的地址空间）才会调用switch_mm。注意到此时程序执行在内核地址空间，因此不会有影响。
如果切出的A进程是内核线程，那么其借用的那个地址空间（active_mm）已经不需要继续使用了。除此之外，我们这里还设定了run queue上一次使用的mm struct（rq->prev_mm）为oldmm。在最后的finish_task_switch()中，当A重新运行时，该地址空间的mm会进行清理（mmdrop）。
进程切换的核心。会保存A进程的栈和寄存器，恢复B进程的栈和寄存器。x86-32架构中由一段汇编和一个c函数构成，通过push+jmp+ret伪造call实现eip的切换。注意到这个函数有三个参数。在返回之后，prev记录了当前运行进程的上一个进程，方便在之后使用。
屏障保证编译不会影响程序运行顺序。同时使cache失效，将前后程序隔离。

进行一些清理。如果上一个进程是内核进程，应该mmdrop()解除其借用其他进程的地址空间。如果上一进程已结束，进行相应处理。

static void finish_task_switch(struct rq *rq, struct task_struct *prev) __releases(rq->lock)
{
	struct mm_struct *mm = rq->prev_mm;
	long prev_state;

	rq->prev_mm = NULL;

	prev_state = prev->state;
	finish_arch_switch(prev);
        ...
	if (mm)
		mmdrop(mm);
	if (unlikely(prev_state == TASK_DEAD)) {
		kprobe_flush_task(prev);
		put_task_struct(prev);
	}
}

Switch To

MIPS架构的switch_to()通过调用叶函数resume进行处理。注意到参数last会进行修改。如上文讲到的，从进程A切换到进程B的过程中，A的信息通过prev进入switch_to函数，通过寄存器保存，最后进入B进程的上下文环境，成为参数last，作为指针回到context_switch函数，帮助B进程获取上一个进程的信息。

#define switch_to(prev, next, last)					\
do {									\
	__mips_mt_fpaff_switch_to(prev);				\
	if (cpu_has_dsp)						\
		__save_dsp(prev);					\
	__clear_software_ll_bit();					\
	(last) = resume(prev, next, task_thread_info(next));		\
} while (0)

叶函数resume的汇编代码如下。需要关注的内容直接注释在代码内了。在这个函数最后返回时会切换到其他进程。根据ABI，只有s0-s7是子函数负责保存的，加上栈帧寄存器sp，fp和返回地址寄存器ra。其他寄存器不需要子函数保存。

/*
 * task_struct *resume(task_struct *prev, task_struct *next,
 *                     struct thread_info *next_ti) )
 */# 这三个参数通过寄存器a0，a1，a2传递
LEAF(resume)
	mfc0	t1, CP0_STATUS          # CP0 12号寄存器，记录操作模式，中断使能，诊断状态等信息 
	sw	t1, THREAD_STATUS(a0)       # 存储在task_struct.thread内
	cpu_save_nonscratch a0          # 宏，保存s0-s7,sp,fp寄存器
	sw	ra, THREAD_REG31(a0)        # 保存返回地址

	/*
	 * check if we need to save FPU registers
	 */
	lw	t3, TASK_THREAD_INFO(a0)    
	lw	t0, TI_FLAGS(t3)
	li	t1, _TIF_USEDFPU
	and	t2, t0, t1
	beqz	t2, 1f
	nor	t1, zero, t1

	and	t0, t0, t1
	sw	t0, TI_FLAGS(t3)

	/*
	 * clear saved user stack CU1 bit
	 */
	lw	t0, ST_OFF(t3)
	li	t1, ~ST0_CU1
	and	t0, t0, t1
	sw	t0, ST_OFF(t3)

	fpu_save_single a0, t0			# clobbers t0

1:
	/*
	 * The order of restoring the registers takes care of the race
	 * updating $28, $29 and kernelsp without disabling ints.
	 */
	move	$28, a2         # 从这里开始恢复切回的进程的信息
	cpu_restore_nonscratch a1           #恢复s0-s7,sp,fp,ra。注意ra是切回进程的ra，从而实现最后的跳转

	addiu	t1, $28, _THREAD_SIZE - 32
	sw	t1, kernelsp            # 恢复内核空间

	mfc0	t1, CP0_STATUS		/* Do we really need this? */
	li	a3, 0xff01
	and	t1, a3
	lw	a2, THREAD_STATUS(a1)
	nor	a3, $0, a3
	and	a2, a3
	or	a2, t1
	mtc0	a2, CP0_STATUS          # 这一段从进程存储的CP0状态寄存器的1-7位恢复进cp0 12号状态寄存器。这几位的功能在下表中。
	move	v0, a0          # 保证返回值是prev。这个值最后成为last，帮助进程记录切换前的进程。
	jr	ra          # 切换进程
	END(resume)

参考文章：
【1】 lkd3
【2】 Linux Kernel 2.6.34源码
【3】 http://www.wowotech.net/process_management/context-switch-arch.html
【4】 MIPS32™ Architecture For Programmers Volume III: The MIPS32™ Privileged Resource Architecture