Chinaunix首页 | 论坛 | 博客
  • 博客访问: 236466
  • 博文数量: 32
  • 博客积分: 557
  • 博客等级: 中士
  • 技术积分: 431
  • 用 户 组: 普通用户
  • 注册时间: 2011-04-20 23:05
文章分类

全部博文(32)

文章存档

2015年(4)

2014年(2)

2012年(4)

2011年(22)

分类: LINUX

2011-10-30 22:24:01

Linux内核中断

@仅供交流学习使用,勿做商业使用

Linux Kernel Code: 2.6.35.7

interrupt or exception previlidge check

privilege check: 允许用户程序调用内核程序;而禁止内核调用用户程序,以防用户恶意程序; 如此,在用户态(3)和内核态(0)都可以发生中断,并执行中断处理程序(0)了。

关于privilege check最大的疑惑在于:中断是可以在用户态发生的;

From ULK3:

Makes sure the interrupt was issued by an authorized source. First, it compares the Current Privilege Level (CPL), which is stored in the two least significant bits of the cs register, with the Descriptor Privilege Level (DPL ) of the Segment Descriptor included in the GDT. Raises a "General protection" exception if the CPL is lower than the DPL, because the interrupt handler cannot have a lower privilege than the program that caused the interrupt. For programmed exceptions, makes a further security check: compares the CPL with the DPL of the gate descriptor included in the IDT and raises a "General protection" exception if the DPL is lower than the CPL. This last check makes it possible to prevent access by user applications to specific trap or interrupt gates.

Note:

  1. 上述描述中CPL, DPL的比较是以数值进行比较而得出大小结论的;
  2. 只有编程产生的中断,即int xx产生的中断,才会进行第二个检查;而硬件产生的中断不会检查,如上Intel手册所说;

  1. 第一个检查保证内核不调用用户程序;
  2. 第二个检查保证用户程序不会通过编程访问中断和陷阱门;

6.12.1.1 Protection of Exception- and Interrupt-Handler Procedures

The privilege-level protection for exception- and interrupt-handler procedures is similar to that used for ordinary procedure calls when called through a call gate (see Section 5.8.4, “Accessing a Code Segment Through a Call Gate”). The processor does not permit transfer of execution to an exception- or interrupt-handler procedure in a less privileged code segment (numerically greater privilege level) than the CPL.
An attempt to violate this rule results in a general-protection exception (#GP). The protection mechanism for exception- and interrupt-handler procedures is different in the following ways:
  • Because interrupt and exception vectors have no RPL, the RPL is not checked on implicit calls to exception and interrupt handlers.
  • The processor checks the DPL of the interrupt or trap gate only if an exception or interrupt is generated with an INT n, INT 3, or INTO instruction. Here, the CPL must be less than or equal to the DPL of the gate. This restriction prevents application programs or procedures running at privilege level 3 from using a software interrupt to access critical exception handlers, such as the page-fault handler, providing that those handlers are placed in more privileged code segments (numerically lower privilege level). For hardware-generated interrupts and processor-detected exceptions, the processor ignores the DPL of interrupt and trap gates.

64-ia-32-architectures-software-developer-vol-1-manual.pdf
64-ia-32-architectures-software-developer-vol-2a-2b-instruction-set-a-z-manual.pdf
64-ia-32-architectures-software-developer-vol-3a-3b-system-programming-manual.pdf


GDT是per-cpu的 

In uniprocessor systems there is only one GDT, while in multiprocessor systems there is one GDT for every CPU in the system. All GDTs are stored in the cpu_gdt_table array, while the addresses and sizes of the GDTs (used when initializing the gdtr registers) are stored in the cpu_gdt_descr array.


real-mode address space

Intel(R) 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: 17.1.1 Address Translation in Real-Address Mode:

When using 8086-style address translation, it is possible to specify addresses larger than 1 MByte. For example, with a segment selector value of FFFFH and an offset of FFFFH, the linear (and physical) address would be 10FFEFH (1 megabyte plus 64 KBytes).

setup_idt(); setup_gdt(); protected_mode_jump(boot_params.hdr.code32_start,             (u32)&boot_params + (ds() << 4));

为何要在跳入保护模式之前,填充临时的idt&gdt呢?

address translation

Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 (3A & 3B):System Programming Guide:

3.1 MEMORY MANAGEMENT OVERVIEW


If paging is not used, the linear address space of the processor is mapped directly into the physical address space of processor. The physical address space is defined as the range of addresses that the processor can generate on its address bus.

 

Jump into Protected Mode 

  commit 2ee2394b682c0ee99b0f083abe6c57727e6edb69 Author: H. Peter Anvin <hpa@zytor.com>
Date:   Mon Jun 30 15:42:47 2008 -0700

    x86
: fix regression: boot failure on AMD Elan TS-5500
   
   
Jeremy Fitzhardinge wrote:
   
>
   
> Maybe it really does require the far jump immediately after setting PE
   
> in cr0...
   
>
   
> Hm, I don't remember this paragraph being in vol 3a, section 8.9.1
    > before.  Is it a recent addition?
    >
    >    Random failures can occur if other instructions exist between steps
    >    3 and 4 above.  Failures will be readily seen in some situations,
    >    such as when instructions that reference memory are inserted between
    >    steps 3 and 4 while in system management mode.
    >
   
    I don'
t remember that, either.
   
   
Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff
--git a/arch/x86/boot/pmjump.S b/arch/x86/boot/pmjump.S
index ab049d4
..141b6e2 100644
--- a/arch/x86/boot/pmjump.S
+++ b/arch/x86/boot/pmjump.S
@@ -33,6 +33,8 @@ protected_mode_jump:
        movw    
%cs, %bx
        shll    $4
, %ebx
        addl    
%ebx, 2f
+       jmp     1f                      # Short jump to serialize on 386/486
+1:
 
        movw    $__BOOT_DS
, %cx
        movw    $__BOOT_TSS
, %di
@@ -40,8 +42,6 @@ protected_mode_jump:
        movl    
%cr0, %edx
        orb     $X86_CR0_PE
, %dl        # Protected mode
        movl    
%edx, %cr0
-       jmp     1f                      # Short jump to serialize on 386/486
-1:
 
       
# Transition to 32-bit mode
       
.byte   0x66, 0xea              # ljmpl opcode

 

 

9.9.1 Switching to Protected Mode

Before switching to protected mode from real mode, a minimum set of system data
structures
and code modules must be loaded into memory, as described in Section

9.8, Software Initialization for Protected-Mode Operation.

Once these tables are created, software initialization code can switch into protected mode. Protected mode is entered by executing a MOV CR0 instruction that sets the PE flag in the CR0 register. (In the same instruction, the PG flag in register CR0 can be set to enable paging.) Execution in protected mode begins with a CPL of 0. Intel 64 and IA-32 processors have slightly different requirements for switching to protected mode. To insure upwards and downwards code compatibility with Intel 64 and IA-32 processors, we recommend that you follow these steps:

1. Disable interrupts. A CLI instruction disables maskable hardware interrupts. NMI interrupts can be disabled with external circuitry. (Software must guarantee that no exceptions or interrupts are generated during the mode switching operation.)
2. Execute the LGDT instruction to load the GDTR register with the base address ofthe GDT.

3. Execute a MOV CR0 instruction that sets the PE flag (and optionally the PG flag)in control register CR0.

4. Immediately following the MOV CR0 instruction, execute a far JMP or far CALL instruction. (This operation is typically a far jump or call to the next instruction in the instruction stream.)

5. The JMP or CALL instruction immediately after the MOV CR0 instruction changes
the flow of execution
and serializes the processor.

6. If paging is enabled, the code for the MOV CR0 instruction and the JMP or CALL instruction must come from a page that is identity mapped (that is, the linear address before the jump is the same as the physical address after paging and protected mode is enabled). The target instruction for the JMP or CALL instruction does not need to be identity mapped.

7. If a local descriptor table is going to be used, execute the LLDT instruction to load the segment selector for the LDT in the LDTR register.

8. Execute the LTR instruction to load the task register with a segment selector to the initial protected-mode task or to a writable area of memory that can be used to store TSS information on a task switch.

9. After entering protected mode, the segment registers continue to hold the
contents they had
in real-address mode. The JMP or CALL instruction in step 4
resets the CS
register. Perform one of the following operations to update the
contents of the remaining segment registers
.

Reload segment registers DS, SS, ES, FS, and GS. If the ES, FS, and/or GS
registers are
not going to be used, load them with a null selector.

Perform a JMP or CALL instruction to a new task, which automatically resets
the values of the segment registers
and branches to a new code segment.

10. Execute the LIDT instruction to load the IDTR register with the address and limit of the protected-mode IDT.

11. Execute the STI instruction to enable maskable hardware interrupts and perform
the necessary hardware operation to enable NMI interrupts
.

Random failures can occur if other instructions exist between steps 3 and 4 above.
Failures will be readily seen in some situations, such as when instructions that reference memory are inserted between steps 3 and 4 while in system management
mode
.

Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 (3A & 3B):System Programming Guide:

9.8 SOFTWARE INITIALIZATION FOR PROTECTED-MODE OPERATION

 

irq_desc->handle_irq在哪里初始化呢?

struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
       
[0 ... NR_IRQS-1] = {
               
.handle_irq     = handle_bad_irq,
               
.depth          = 1,
               
.lock           = __RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock),
       
}
};

 

Documentation/Docbook/genericirq.tmpl:

The interrupt flow handlers (either predefined or architecture specific) are assigned to specific interrupts by the architecture either during bootup or during device initialization.

 

assigned to during bootup:

init_IRQ()->XXX->init_ISA_irqs():
for (i = 0; i < legacy_pic->nr_legacy_irqs; i++) {
       
struct irq_desc *desc = irq_to_desc(i);
        desc
->status = IRQ_DISABLED;
        desc
->action = NULL;
        desc
->depth = 1;
        set_irq_chip_and_handler_name
(i, &i8259A_chip,
                      handle_level_irq
, "XT");
}

 

drivers/gpio/pca953x.c:pca953x_irq_setup():

         for (lvl = 0; lvl < chip->gpio_chip.ngpio; lvl++) {
                       
int irq = lvl + chip->irq_base;

                        set_irq_chip_data
(irq, chip);
                        set_irq_chip_and_handler
(irq, &pca953x_irq_chip,
                                                 handle_edge_irq
);
                        set_irq_nested_thread
(irq, 1);
#ifdef CONFIG_ARM
                        set_irq_flags
(irq, IRQF_VALID);
#else
                        set_irq_noprobe
(irq);
#endif
               
}

arch/mips/lasat/interrupt.c:arch_init_irq():

for (i = LASAT_IRQ_BASE; i <= LASAT_IRQ_END; i++)
                set_irq_chip_and_handler
(i, &lasat_irq_type, handle_level_irq);

 

 

system_calltrap_init中注册,定义在/arch/x86/kernel/entry_32.S文件中

interrupt数组是存放在rodata段中的,该段内存在完成初始化idt之后,还有什么用处呢?是否回收呢?

interrupt数组定义在.init.rodata段,entry_32.S:

.section .init.rodata,"a"
ENTRY
(interrupt)
.text

.init段中的数据会在init完成之后free:

        /* Init code and data - will be freed after init */
       
. = ALIGN(PAGE_SIZE);
       
.init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) {
                __init_begin
= .; /* paired with __init_end */
       
}

#if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
       
/*
         * percpu offsets are zero-based on SMP.  PERCPU_VADDR() changes the
         * output PHDR, so the next output section - .init.text - should
         * start another segment - init.
         */

        PERCPU_VADDR
(0, :percpu)
#endif

        INIT_TEXT_SECTION
(PAGE_SIZE)
#ifdef CONFIG_X86_64
       
:init
#endif

        INIT_DATA_SECTION
(16)

 

INIT_DATA_SECTION宏定义于include/asm-generic/vmlinux.lds.h:

#define INIT_DATA_SECTION(initsetup_align)              \
   
.init.data : AT(ADDR(.init.data) - LOAD_OFFSET) {       \
        INIT_DATA                      
\
        INIT_SETUP
(initsetup_align)             \
        INIT_CALLS                      
\
        CON_INITCALL                        
\
        SECURITY_INITCALL                  
\
        INIT_RAM_FS                    
\
   
}

INIT_DATA同样定义于该文件:

/* init and exit section handling */
#define INIT_DATA                           \
   
*(.init.data)                           \
    DEV_DISCARD
(init.data)                      \
    CPU_DISCARD
(init.data)                      \
    MEM_DISCARD
(init.data)                      \
    KERNEL_CTORS
()                          \
   
*(.init.rodata)                         \
    MCOUNT_REC
()                            \
    DEV_DISCARD
(init.rodata)                    \
    CPU_DISCARD
(init.rodata)                    \
    MEM_DISCARD
(init.rodata)

 

释放初始化内存的调用路径:

 

start_kernel()->rest_init()->new kernel thread: kernel_init()->init_post()->free_initmem();

 

softirq, tasklet, workqueue
  1. softirq在执行之前(do_softirq)会检查是否in_interrupt,如果是,则退出;
  2. 禁止中断;
  3. do_softirq中执行local_bh_disableincrease preempt_count,禁止本地cpu softirq

这样做有如下效果:

  1. 在同一个CPU上,所有的延迟函数串行执行;
  2. softirq执行期间不会发生进程切换;
  3. 由于执行softirq时,中断处于禁止状态,所以其中不能有睡眠发生;

因为do_softirq禁止的是本地的softirq,所以其他cpu上的softirq可以正常执行,另外,由于softirq_action中只有一个可重入的函数,并无数据结构需要跨CPU保护,所以即使同一类型的softirq也可以同时在不同的CPU上执行;但诚如上面所说,所有种类的延迟函数,在同一个CPU上,都是串行执行的;

tasklet是在softirq的基础上实现的,所以具有上述的大部分特点,只是tasklet_struct中包含需要跨CPU保护的data,所以在tasklet_action中,执行相应tasklet时会检查对应的标志,如果其他CPU,已经在执行,则重新插入本cputasklet_head的链表中,等待下次执行。

如此tasklet具有了softirq的另外一个特性:同一类型的softirq同时只能在一个CPU上执行;当然,不同类型的tasklet可以同时在不同的CPU上执行。

workqueue在进程上下文执行——执行时并没有对中断作假设,所以可以睡眠。

TODO: TSS…
  1. TSS的概念,及中断时TSS的切换;
  2. 中断发生的时机:
    • 发生在系统进程运行时,这个我们了解的已经很清楚了;
    • 发生在系统处理中断时,此时中断处理程序已经禁止了中断,
      • A. 发生可屏蔽中断;
      • B. 发生不可屏蔽中断;
TSS

The processor transfers execution to another task in one of four cases:
The current program, task, or procedure executes a JMP or CALL instruction to a TSS descriptor in the GDT.
The current program, task, or procedure executes a JMP or CALL instruction to a task-gate descriptor in the GDT or the current LDT.
An interrupt or exception vector points to a task-gate descriptor in the IDT.
The current task executes an IRET when the NT flag in the EFLAGS register is set.

注意,并非所有的jmp/call都会引起task switch,同样,也并非所有的interrupt/exception/iret会引起task switch

  • jmp/call只有在操作符为TSS Descriptor/task-gate的时候才引起task switch
  • interrupt/exception只有idt中的相应项为task gate的时候,才会引起,Linuxidt中只有一个task gate, 它处理double fault
  • iret只有在设置了nested task标志的时候,才会switch task to previous one.

All of these methods for dispatching a task identify the task to be dispatched with a segment selector that points to a task gate or the TSS for the task. When dispatching a task with a CALL or JMP instruction, the selector in the instruction may select the TSS directly or a task gate that holds the selector for the TSS. __When dispatching a task to handle an interrupt or exception, the IDT entry for the interrupt or exception
must contain a task gate that holds the selector
for the interrupt- or exceptionhandler TSS.__

以上引自Intel Manual 3A chap-7

TODO: TSS

TSS的关注点:

  1. 哪里存放? GDT
  2. 何时使用? task switch: jmp/call/exec|intr/iret
  3. 如何操作?

How many TSSs are there?

If TSS Descriptor saved in GDT, where TSSs were located?

由于在SMP系统中,GDTper-cpu的,由上图可以看出每个CPU有一个通用TSSd和一个 double fault专用TSSd;

FROM ULK3: 3.3.2. Task State Segment

The 80x86 architecture includes a specific segment type called the Task State Segment (TSS), to store hardware contexts.

Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system.

This is done for two main reasons:

  • When an 80x86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS (see the sections "Hardware Handling of Interrupts and Exceptions" in Chapter 4 and "Issuing a System Call via the sysenter Instruction" in Chapter 10).
  • When a User Mode process attempts to access an I/O port by means of an in or out instruction, the CPU may need to access an I/O Permission Bitmap stored in the TSS to verify whether the process is allowed to address the port.

其中说,Linux并不使用hardware context switches!

但是,也没有禁止(PS. 我目前不知道禁止Intel CPU task switch的方法),所以所有的task(kernel path, or user processes)共用同一个TSS(d),不要钻double fault的牛角尖,:)

 

用意在于避免禁止中断时间过长的软中断,执行时为何要禁止中断?

refer to: Intel Manual 3a: 6.8 ENABLING AND DISABLING INTERRUPTS

禁止中断,并不能禁止non-maskable interrupts & exceptions,于是造成了中断嵌套(nested interrupts).

 

when the IF flag is set, interrupts delivered to the INTR or through the local APIC pin are processed as normal external interrupts.

在中断禁止期间,并不会ack中断,清除INTR状态,那么在重新设置IF标志位之后,先前的INTR状态,是否能得到处理呢?

要弄清这个,需要理解:

  • CPU 通过INTR处理外部中断的机理;
  • APIC的工作原理(必要时可以看LinuxAPIC的驱动)

软中断做的是一些可延迟的费时间的事,当然不能在中断里执行了。

下面附有do_softirq代码,可以看到在执行可延迟函数第一件事就是开中断。但在开始之前,禁用了下半部中断(local_bh_disable)。这样就算被中断了,返回内核时也不会被抢占,还是执行这里的代码。也不会被调度。

那么这样的后果就是软中断上下文里的会一直执行下去,直到到达了限定次数,然后唤醒守护进程。

再返回看一下do_softirq()的代码,发现确实如此,在其实际执行softirq_action之前,确实是打开了中断的,所以可以说softirq在执行实际的延迟函数时,并没有禁用中断。

上面的分析,忽略了一个效果,就是local_bh_disable造成了在本地CPU上,softirq的串行执行,因为在do_softirq的最开始会判断是否in_interrupt.

其实,我还有另外一个不成熟的想法:

之所以,interrupt/exception handler必须尽量的短,是因为在执行完handler之后,才ack irq line,清除irq line的状态,让这条line上新的irq可以被识别到。

这里中断状态可以从两个角度观察:

  • irq line, CPU外部;
  • cpu内部;

CPU内部可以通过clear IF flag来禁止CPU对外部中断的响应,但是外部中断依然可以发生,处不处理,irq line的状态就在那里,重新set IF flag之后,就会被看到;但是如果中断发生之后,不立即清除irq line的状态,即使有新的相同中断发生,也无法识别到irq line状态的改变。

这想法确实不成熟,模糊的地方在于ack irq line的时间,可以到do_IRQ中去看一下:

do_IRQ()->handle_irq()->eg. handle_level_irq():

void
handle_level_irq
(unsigned int irq, struct irq_desc *desc)
{
       
struct irqaction *action;
        irqreturn_t action_ret
;

        raw_spin_lock
(&desc->lock);
        mask_ack_irq
(desc, irq);

******
        action
= desc->action;
       
if (unlikely(!action || (desc->status & IRQ_DISABLED)))
               
goto out_unlock;

        desc
->status |= IRQ_INPROGRESS;
        raw_spin_unlock
(&desc->lock);

        action_ret
= handle_IRQ_event(irq, action);
******
       
if (!(desc->status & (IRQ_DISABLED | IRQ_ONESHOT)))
                unmask_irq
(desc, irq);
out_unlock
:
        raw_spin_unlock
(&desc->lock);
}

可以看到在中断处理的最开始,就调用mask_ack_irq()对中断进行了ack,呵呵,但同时还多了一个mask,就是说,即使现在ack了,该中断也是被mask了的,这是对外部APIC该种中断的禁止,APIC如果再发现这种,也不用改变irq line线的状态了。

static inline void mask_ack_irq(struct irq_desc *desc, int irq)
{
       
if (desc->chip->mask_ack)
                desc
->chip->mask_ack(irq);
       
else {
                desc
->chip->mask(irq);
               
if (desc->chip->ack)
                        desc
->chip->ack(irq);
       
}
        desc
->status |= IRQ_MASKED;
}

 

from ULK3:

Each IRQ line can be selectively disabled. Thus, the PIC can be programmed to disable IRQs. That is, the PIC can be told to stop issuing interrupts that refer to a given IRQ line, or to resume issuing them. Disabled interrupts are not lost; the PIC sends them to the CPU as soon as they are enabled again. This feature is used by most interrupt handlers, because it allows them to process IRQs of the same type serially.

Selective enabling/disabling of IRQs is not the same as global masking/unmasking of maskable interrupts. When the IF flag of the eflags register is clear, each maskable interrupt issued by the PIC is temporarily ignored by the CPU.

 

如何在real-mode将内核代码置于1M之上?

系统启动时,kernel的第二部分,被放在0x100000起始的位置,也就是1M以上。

这是如何做到的呢,此时CPU还处在real-mode

答案很简单:kernelbootloader放的,通过对u-boot代码的阅读,u-boot加载kernel image时是进入了protected-mode的,当加载完成之后,需要将控制权交给linux os kernel的之前那一刻,又将CPU带回real-mode.

OK, real-mode CPU可以寻址1M以上的空间,但只能寻到64k,很显然,这不够存放第二部分kernel image

 

补:interrupt内核驱动架构:

There are three main levels of abstraction in the interrupt code:
 
* Highlevel driver API
 
* Highlevel IRQ flow handlers
 
* Chiplevel hardware encapsulation

Each interrupt is described by an interrupt descriptor structure irq_desc. The interrupt is referenced by an 'unsigned int' numeric value which selects the corresponding interrupt decription structure in the descriptor structures array. The descriptor structure contains status information and pointers to the interrupt flow method and the interrupt chip structure which are assigned to this interrupt.

Whenever an interrupt triggers, the lowlevel arch code calls into the generic interrupt code by calling desc->handle_irq(). This highlevel IRQ handling function only uses desc->chip primitives referenced by the assigned chip descriptor structure.
  • Chiplevel hardware encapsulation, 封装interrupt controller驱动,如ack(), mask()等操作;
  • Highlevel IRQ flow handlers, provides a set of pre-defined irq-flow methods,如对边沿、电平中断的处理流程;
  • Highlevel driver API, 中断设备对中断的响应,如网卡收发数据包等;

参考:Document/DocBook/genericirq.tmpl

  Linux内核中断初始化.doc   

Wiki: 

 

 

 

阅读(4258) | 评论(1) | 转发(0) |
0

上一篇:Linux进程调度

下一篇:Linux内核内存初始化

给主人留下些什么吧!~~