Kernel initialization. Part 5.

Continue of architecture-specific initialization

In the previous part, we stopped at the initialization of an architecture-specific stuff from the setup_arch function and now we will continue with it. As we reserved memory for the initrd, next step is the olpc_ofw_detect which detects One Laptop Per Child support. We will not consider platform related stuff in this book and will skip functions related with it. So let's go ahead. The next step is the early_trap_init function. This function initializes debug (#DB - raised when the TF flag of rflags is set) and int3 (#BP) interrupts gate. If you don't know anything about interrupts, you can read about it in the Early interrupt and exception handling. In x86 architecture INT, INTO and INT3 are special instructions which allow a task to explicitly call an interrupt handler. The INT3 instruction calls the breakpoint (#BP) handler. You may remember, we already saw it in the part about interrupts: and exceptions:

----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description         |Type |Error Code|Source                   |
----------------------------------------------------------------------------------------------
|3     | #BP    |Breakpoint          |Trap |NO        |INT 3                    |
----------------------------------------------------------------------------------------------

Debug interrupt #DB is the primary method of invoking debuggers. early_trap_init defined in the arch/x86/kernel/traps.c. This functions sets #DB and #BP handlers and reloads IDT:

void __init early_trap_init(void)
{
        set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
        set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
        load_idt(&idt_descr);
}

We already saw implementation of the set_intr_gate in the previous part about interrupts. Here are two similar functions set_intr_gate_ist and set_system_intr_gate_ist. Both of these two functions take three parameters:

number of the interrupt;
base address of the interrupt/exception handler;
third parameter is - Interrupt Stack Table. IST is a new mechanism in the x86_64 and part of the TSS. Every active thread in kernel mode has own kernel stack which is 16 kilobytes. While a thread in user space, this kernel stack is empty.

In addition to per-thread stacks, there are a couple of specialized stacks associated with each CPU. All about these stack you can read in the linux kernel documentation - Kernel stacks. x86_64 provides feature which allows to switch to a new special stack for during any events as non-maskable interrupt and etc... And the name of this feature is - Interrupt Stack Table. There can be up to 7 IST entries per CPU and every entry points to the dedicated stack. In our case this is DEBUG_STACK.

set_intr_gate_ist and set_system_intr_gate_ist work by the same principle as set_intr_gate with only one difference. Both of these functions checks interrupt number and call _set_gate inside:

BUG_ON((unsigned)n > 0xFF);
_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);

as set_intr_gate does this. But set_intr_gate calls _set_gate with dpl - 0, and ist - 0, but set_intr_gate_ist and set_system_intr_gate_ist sets ist as DEBUG_STACK and set_system_intr_gate_ist sets dpl as 0x3 which is the lowest privilege. When an interrupt occurs and the hardware loads such a descriptor, then hardware automatically sets the new stack pointer based on the IST value, then invokes the interrupt handler. All of the special kernel stacks will be set in the cpu_init function (we will see it later).

As #DB and #BP gates written to the idt_descr, we reload IDT table with load_idt which just call ldtr instruction. Now let's look on interrupt handlers and will try to understand how they works. Of course, I can't cover all interrupt handlers in this book and I do not see the point in this. It is very interesting to delve in the linux kernel source code, so we will see how debug handler implemented in this part, and understand how other interrupt handlers are implemented will be your task.

DB handler

As you can read above, we passed address of the #DB handler as &debug in the set_intr_gate_ist. lxr.free-electrons.com is a great resource for searching identifiers in the linux kernel source code, but unfortunately you will not find debug handler with it. All of you can find, it is debug definition in the arch/x86/include/asm/traps.h:

asmlinkage void debug(void);

We can see asmlinkage attribute which tells to us that debug is function written with assembly. Yeah, again and again assembly :). Implementation of the #DB handler as other handlers is in this arch/x86/entry/entry_64.S and defined with the idtentry assembly macro:

idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK

idtentry is a macro which defines an interrupt/exception entry point. As you can see it takes five arguments:

name of the interrupt entry point;
name of the interrupt handler;
has interrupt error code or not;
paranoid - if this parameter = 1, switch to special stack (read above);
shift_ist - stack to switch during interrupt.

Now let's look on idtentry macro implementation. This macro defined in the same assembly file and defines debug function with the ENTRY macro. For the start idtentry macro checks that given parameters are correct in case if need to switch to the special stack. In the next step it checks that give interrupt returns error code. If interrupt does not return error code (in our case #DB does not return error code), it calls INTR_FRAME or XCPT_FRAME if interrupt has error code. Both of these macros XCPT_FRAME and INTR_FRAME do nothing and need only for the building initial frame state for interrupts. They uses CFI directives and used for debugging. More info you can find in the CFI directives. As comment from the arch/x86/kernel/entry_64.S says: CFI macros are used to generate dwarf2 unwind information for better backtraces. They don't change any code. so we will ignore them.

.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
    /* Sanity check */
    .if \shift_ist != -1 && \paranoid == 0
    .error "using shift_ist requires paranoid=1"
    .endif

    .if \has_error_code
    XCPT_FRAME
    .else
    INTR_FRAME
    .endif
    ...
    ...
    ...

You can remember from the previous part about early interrupts/exceptions handling that after interrupt occurs, current stack will have following format:

    +-----------------------+
    |                       |
+40 |         SS            |
+32 |         RSP           |
+24 |        RFLAGS         |
+16 |         CS            |
+8  |         RIP           |
 0  |       Error Code      | <---- rsp
    |                       |
    +-----------------------+

The next two macro from the idtentry implementation are:

    ASM_CLAC
    PARAVIRT_ADJUST_EXCEPTION_FRAME

First ASM_CLAC macro depends on CONFIG_X86_SMAP configuration option and need for security reason, more about it you can read here. The second PARAVIRT_ADJUST_EXCEPTION_FRAME macro is for handling handle Xen-type-exceptions (this chapter about kernel initialization and we will not consider virtualization stuff here).

The next piece of code checks if interrupt has error code or not and pushes $-1 which is 0xffffffffffffffff on x86_64 on the stack if not:

    .ifeq \has_error_code
    pushq_cfi $-1
    .endif

We need to do it as dummy error code for stack consistency for all interrupts. In the next step we subtract from the stack pointer $ORIG_RAX-R15:

    subq $ORIG_RAX-R15, %rsp

where ORIRG_RAX, R15 and other macros defined in the arch/x86/include/asm/calling.h and ORIG_RAX-R15 is 120 bytes. General purpose registers will occupy these 120 bytes because we need to store all registers on the stack during interrupt handling. After we set stack for general purpose registers, the next step is checking that interrupt came from userspace with:

testl $3, CS(%rsp)
jnz 1f

Here we checks first and second bits in the CS. You can remember that CS register contains segment selector where first two bits are RPL. All privilege levels are integers in the range 0–3, where the lowest number corresponds to the highest privilege. So if interrupt came from the kernel mode we call save_paranoid or jump on label 1 if not. In the save_paranoid we store all general purpose registers on the stack and switch user gs on kernel gs if need:

    movl $1,%ebx
    movl $MSR_GS_BASE,%ecx
    rdmsr
    testl %edx,%edx
    js 1f
    SWAPGS
    xorl %ebx,%ebx
1:    ret

In the next steps we put pt_regs pointer to the rdi, save error code in the rsi if it has and call interrupt handler which is - do_debug in our case from the arch/x86/kernel/traps.c. do_debug like other handlers takes two parameters:

pt_regs - is a structure which presents set of CPU registers which are saved in the process' memory region;
error code - error code of interrupt.

After interrupt handler finished its work, calls paranoid_exit which restores stack, switch on userspace if interrupt came from there and calls iret. That's all. Of course it is not all :), but we will see more deeply in the separate chapter about interrupts.

This is general view of the idtentry macro for #DB interrupt. All interrupts are similar to this implementation and defined with idtentry too. After early_trap_init finished its work, the next function is early_cpu_init. This function defined in the arch/x86/kernel/cpu/common.c and collects information about CPU and its vendor.

Early ioremap initialization

The next step is initialization of early ioremap. In general there are two ways to communicate with devices:

I/O Ports;
Device memory.

We already saw first method (outb/inb instructions) in the part about linux kernel booting process. The second method is to map I/O physical addresses to virtual addresses. When a physical address is accessed by the CPU, it may refer to a portion of physical RAM which can be mapped on memory of the I/O device. So ioremap used to map device memory into kernel address space.

As i wrote above next function is the early_ioremap_init which re-maps I/O memory to kernel address space so it can access it. We need to initialize early ioremap for early initialization code which needs to temporarily map I/O or memory regions before the normal mapping functions like ioremap are available. Implementation of this function is in the arch/x86/mm/ioremap.c. At the start of the early_ioremap_init we can see definition of the pmd pointer with pmd_t type (which presents page middle directory entry typedef struct { pmdval_t pmd; } pmd_t; where pmdval_t is unsigned long) and make a check that fixmap aligned in a correct way:

pmd_t *pmd;
BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));

fixmap - is fixed virtual address mappings which extends from FIXADDR_START to FIXADDR_TOP. Fixed virtual addresses are needed for subsystems that need to know the virtual address at compile time. After the check early_ioremap_init makes a call of the early_ioremap_setup function from the mm/early_ioremap.c. early_ioremap_setup fills slot_virt array of the unsigned long with virtual addresses with 512 temporary boot-time fix-mappings:

for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
    slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);

After this we get page middle directory entry for the FIX_BTMAP_BEGIN and put to the pmd variable, fills bm_pte with zeros which is boot time page tables and call pmd_populate_kernel function for setting given page table entry in the given page middle directory:

pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte, 0, sizeof(bm_pte));
pmd_populate_kernel(&init_mm, pmd, bm_pte);

That's all for this. If you feeling puzzled, don't worry. There is special part about ioremap and fixmaps in the Linux Kernel Memory Management. Part 2 chapter.

Obtaining major and minor numbers for the root device

After early ioremap was initialized, you can see the following code:

ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);

This code obtains major and minor numbers for the root device where initrd will be mounted later in the do_mount_root function. Major number of the device identifies a driver associated with the device. Minor number referred on the device controlled by driver. Note that old_decode_dev takes one parameter from the boot_params_structure. As we can read from the x86 linux kernel boot protocol:

Field name:    root_dev
Type:        modify (optional)
Offset/size:    0x1fc/2
Protocol:    ALL

  The default root device device number.  The use of this field is
  deprecated, use the "root=" option on the command line instead.

Now let's try to understand what old_decode_dev does. Actually it just calls MKDEV inside which generates dev_t from the give major and minor numbers. It's implementation is pretty simple:

static inline dev_t old_decode_dev(u16 val)
{
         return MKDEV((val >> 8) & 255, val & 255);
}

where dev_t is a kernel data type to present major/minor number pair. But what's the strange old_ prefix? For historical reasons, there are two ways of managing the major and minor numbers of a device. In the first way major and minor numbers occupied 2 bytes. You can see it in the previous code: 8 bit for major number and 8 bit for minor number. But there is a problem: only 256 major numbers and 256 minor numbers are possible. So 16-bit integer was replaced by 32-bit integer where 12 bits reserved for major number and 20 bits for minor. You can see this in the new_decode_dev implementation:

static inline dev_t new_decode_dev(u32 dev)
{
         unsigned major = (dev & 0xfff00) >> 8;
         unsigned minor = (dev & 0xff) | ((dev >> 12) & 0xfff00);
         return MKDEV(major, minor);
}

After calculation we will get 0xfff or 12 bits for major if it is 0xffffffff and 0xfffff or 20 bits for minor. So in the end of execution of the old_decode_dev we will get major and minor numbers for the root device in ROOT_DEV.

Memory map setup

The next point is the setup of the memory map with the call of the setup_memory_map function. But before this we setup different parameters as information about a screen (current row and column, video page and etc... (you can read about it in the Video mode initialization and transition to protected mode)), Extended display identification data, video mode, bootloader_type and etc...:

    screen_info = boot_params.screen_info;
    edid_info = boot_params.edid_info;
    saved_video_mode = boot_params.hdr.vid_mode;
    bootloader_type = boot_params.hdr.type_of_loader;
    if ((bootloader_type >> 4) == 0xe) {
        bootloader_type &= 0xf;
        bootloader_type |= (boot_params.hdr.ext_loader_type+0x10) << 4;
    }
    bootloader_version  = bootloader_type & 0xf;
    bootloader_version |= boot_params.hdr.ext_loader_ver << 4;

All of these parameters we got during boot time and stored in the boot_params structure. After this we need to setup the end of the I/O memory. As you know one of the main purposes of the kernel is resource management. And one of the resource is memory. As we already know there are two ways to communicate with devices are I/O ports and device memory. All information about registered resources are available through:

/proc/ioports - provides a list of currently registered port regions used for input or output communication with a device;
/proc/iomem - provides current map of the system's memory for each physical device.

At the moment we are interested in /proc/iomem:

cat /proc/iomem
00000000-00000fff : reserved
00001000-0009d7ff : System RAM
0009d800-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cffff : Video ROM
000d0000-000d3fff : PCI Bus 0000:00
000d4000-000d7fff : PCI Bus 0000:00
000d8000-000dbfff : PCI Bus 0000:00
000dc000-000dffff : PCI Bus 0000:00
000e0000-000fffff : reserved
  000e0000-000e3fff : PCI Bus 0000:00
  000e4000-000e7fff : PCI Bus 0000:00
  000f0000-000fffff : System ROM

As you can see range of addresses are shown in hexadecimal notation with its owner. Linux kernel provides API for managing any resources in a general way. Global resources (for example PICs or I/O ports) can be divided into subsets - relating to any hardware bus slot. The main structure resource:

struct resource {
        resource_size_t start;
        resource_size_t end;
        const char *name;
        unsigned long flags;
        struct resource *parent, *sibling, *child;
};

presents abstraction for a tree-like subset of system resources. This structure provides range of addresses from start to end (resource_size_t is phys_addr_t or u64 for x86_64) which a resource covers, name of a resource (you see these names in the /proc/iomem output) and flags of a resource (All resources flags defined in the include/linux/ioport.h). The last are three pointers to the resource structure. These pointers enable a tree-like structure:

+-------------+      +-------------+
|             |      |             |
|    parent   |------|    sibling  |
|             |      |             |
+-------------+      +-------------+
       |
       |
+-------------+
|             |
|    child    | 
|             |
+-------------+

Every subset of resources has root range resources. For iomem it is iomem_resource which defined as:

struct resource iomem_resource = {
        .name   = "PCI mem",
        .start  = 0,
        .end    = -1,
        .flags  = IORESOURCE_MEM,
};
EXPORT_SYMBOL(iomem_resource);

TODO EXPORT_SYMBOL

iomem_resource defines root addresses range for io memory with PCI mem name and IORESOURCE_MEM (0x00000200) as flags. As i wrote above our current point is setup the end address of the iomem. We will do it with:

iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;

Here we shift 1 on boot_cpu_data.x86_phys_bits. boot_cpu_data is cpuinfo_x86 structure which we filled during execution of the early_cpu_init. As you can understand from the name of the x86_phys_bits field, it presents maximum bits amount of the maximum physical address in the system. Note also that iomem_resource is passed to the EXPORT_SYMBOL macro. This macro exports the given symbol (iomem_resource in our case) for dynamic linking or in other words it makes a symbol accessible to dynamically loaded modules.

After we set the end address of the root iomem resource address range, as I wrote above the next step will be setup of the memory map. It will be produced with the call of the setup_ memory_map function:

void __init setup_memory_map(void)
{
        char *who;

        who = x86_init.resources.memory_setup();
        memcpy(&e820_saved, &e820, sizeof(struct e820map));
        printk(KERN_INFO "e820: BIOS-provided physical RAM map:\n");
        e820_print_map(who);
}

First of all we call look here the call of the x86_init.resources.memory_setup. x86_init is a x86_init_ops structure which presents platform specific setup functions as resources initialization, pci initialization and etc... initialization of the x86_init is in the arch/x86/kernel/x86_init.c. I will not give here the full description because it is very long, but only one part which interests us for now:

struct x86_init_ops x86_init __initdata = {
    .resources = {
            .probe_roms             = probe_roms,
            .reserve_resources      = reserve_standard_io_resources,
            .memory_setup           = default_machine_specific_memory_setup,
    },
    ...
    ...
    ...
}

As we can see here memory_setup field is default_machine_specific_memory_setup where we get the number of the e820 entries which we collected in the boot time, sanitize the BIOS e820 map and fill e820map structure with the memory regions. As all regions are collected, print of all regions with printk. You can find this print if you execute dmesg command and you can see something like this:

[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000be825fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000be826000-0x00000000be82cfff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000be82d000-0x00000000bf744fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000bf745000-0x00000000bfff4fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000bfff5000-0x00000000dc041fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000dc042000-0x00000000dc0d2fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000dc0d3000-0x00000000dc138fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000dc139000-0x00000000dc27dfff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000dc27e000-0x00000000deffefff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000defff000-0x00000000deffffff] usable
...
...
...

Copying of the BIOS Enhanced Disk Device information

The next two steps is parsing of the setup_data with parse_setup_data function and copying BIOS EDD to the safe place. setup_data is a field from the kernel boot header and as we can read from the x86 boot protocol:

Field name:    setup_data
Type:        write (special)
Offset/size:    0x250/8
Protocol:    2.09+

  The 64-bit physical pointer to NULL terminated single linked list of
  struct setup_data. This is used to define a more extensible boot
  parameters passing mechanism.

It used for storing setup information for different types as device tree blob, EFI setup data and etc... In the second step we copy BIOS EDD information from the boot_params structure that we collected in the arch/x86/boot/edd.c to the edd structure:

static inline void __init copy_edd(void)
{
     memcpy(edd.mbr_signature, boot_params.edd_mbr_sig_buffer,
            sizeof(edd.mbr_signature));
     memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));
     edd.mbr_signature_nr = boot_params.edd_mbr_sig_buf_entries;
     edd.edd_info_nr = boot_params.eddbuf_entries;
}

Memory descriptor initialization

The next step is initialization of the memory descriptor of the init process. As you already can know every process has its own address space. This address space presented with special data structure which called memory descriptor. Directly in the linux kernel source code memory descriptor presented with mm_struct structure. mm_struct contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the include/linux/mm_types.h. As every process has its own memory descriptor, task_struct structure contains it in the mm and active_mm field. And our first init process has it too. You can remember that we saw the part of initialization of the init task_struct with INIT_TASK macro in the previous part:

#define INIT_TASK(tsk)  \
{
    ...
    ...
    ...
    .mm = NULL,         \
    .active_mm  = &init_mm, \
    ...
}

mm points to the process address space and active_mm points to the active address space if process has no address space such as kernel threads (more about it you can read in the documentation). Now we fill memory descriptor of the initial process:

    init_mm.start_code = (unsigned long) _text;
    init_mm.end_code = (unsigned long) _etext;
    init_mm.end_data = (unsigned long) _edata;
    init_mm.brk = _brk_end;

with the kernel's text, data and brk. init_mm is the memory descriptor of the initial process and defined as:

struct mm_struct init_mm = {
    .mm_rb          = RB_ROOT,
    .pgd            = swapper_pg_dir,
    .mm_users       = ATOMIC_INIT(2),
    .mm_count       = ATOMIC_INIT(1),
    .mmap_sem       = __RWSEM_INITIALIZER(init_mm.mmap_sem),
    .page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
    .mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
    INIT_MM_CONTEXT(init_mm)
};

where mm_rb is a red-black tree of the virtual memory areas, pgd is a pointer to the page global directory, mm_users is address space users, mm_count is primary usage counter and mmap_sem is memory area semaphore. After we setup memory descriptor of the initial process, next step is initialization of the Intel Memory Protection Extensions with mpx_mm_init. The next step is initialization of the code/data/bss resources with:

    code_resource.start = __pa_symbol(_text);
    code_resource.end = __pa_symbol(_etext)-1;
    data_resource.start = __pa_symbol(_etext);
    data_resource.end = __pa_symbol(_edata)-1;
    bss_resource.start = __pa_symbol(__bss_start);
    bss_resource.end = __pa_symbol(__bss_stop)-1;

We already know a little about resource structure (read above). Here we fills code/data/bss resources with their physical addresses. You can see it in the /proc/iomem:

00100000-be825fff : System RAM
  01000000-015bb392 : Kernel code
  015bb393-01930c3f : Kernel data
  01a11000-01ac3fff : Kernel bss

All of these structures are defined in the arch/x86/kernel/setup.c and look like typical resource initialization:

static struct resource code_resource = {
    .name    = "Kernel code",
    .start    = 0,
    .end    = 0,
    .flags    = IORESOURCE_BUSY | IORESOURCE_MEM
};

The last step which we will cover in this part will be NX configuration. NX-bit or no execute bit is 63-bit in the page directory entry which controls the ability to execute code from all physical pages mapped by the table entry. This bit can only be used/set when the no-execute page-protection mechanism is enabled by the setting EFER.NXE to 1. In the x86_configure_nx function we check that CPU has support of NX-bit and it does not disabled. After the check we fill __supported_pte_mask depend on it:

void x86_configure_nx(void)
{
        if (cpu_has_nx && !disable_nx)
                __supported_pte_mask |= _PAGE_NX;
        else
                __supported_pte_mask &= ~_PAGE_NX;
}

Conclusion

It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the setup_arch function which makes initialization of architecture-specific stuff. It was long part, but we have not finished with it. As i already wrote, the setup_arch is big function, and I am really not sure that we will cover all of it even in the next part. There were some new interesting concepts in this part like Fix-mapped addresses, ioremap and etc... Don't worry if they are unclear for you. There is a special part about these concepts - Linux kernel memory management Part 2.. In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameters, early dump of the pci devices, Desktop Management Interface scanning and many many more.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.

Continue architecture-specific boot-time initializations