Kernel initialization. Part 1.

First steps in the kernel code

The previous post was a last part of the Linux kernel booting process chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with pid 1. There are many things to do before the kernel will start first init process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the arch/x86/kernel/head_64.S and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the start_kernel function from the init/main.c will be called.

In the last part of the previous chapter we stopped at the jmp instruction from the arch/x86/boot/compressed/head_64.S assembly source code file:

jmp    *%rax

At this moment the rax register contains address of the Linux kernel entry point which was obtained as a result of the call of the decompress_kernel function from the arch/x86/boot/compressed/misc.c source code file. So, our last instruction in the kernel setup code is a jump on the kernel entry point. We already know where the entry point of the Linux kernel is defined, so we are able to start to learn what Linux kernel does after the start.

First steps in the kernel

Okay, we got the address of the decompressed kernel image from the decompress_kernel function into rax register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the arch/x86/kernel/head_64.S assembly source code file and at the beginning of it, we can see following definitions:

    .globl startup_64

We can see definition of the startup_64 routine that is defined in the __HEAD section, which is just a macro which expands to the definition of executable .head.text section:

#define __HEAD        .section    ".head.text","ax"

We can see definition of this section in the arch/x86/kernel/ linker script:

.text : AT(ADDR(.text) - LOAD_OFFSET) {
    _text = .;
} :text = 0x9090

Besides the definition of the .text section, we can understand default virtual and physical addresses from the linker script. Note that address of the _text is location counter which is defined as:


for x86_64. The definition of the __START_KERNEL macro is located in the arch/x86/include/asm/page_types.h header file and represented by the sum of the base virtual address of the kernel mapping and physical start:



Or in other words:

  • Base physical address of the Linux kernel - 0x1000000;
  • Base virtual address of the Linux kernel - 0xffffffff81000000.

After we sanitized CPU configuration, we call __startup_64 function which is defined in arch/x86/kernel/head64.c:

    leaq    _text(%rip), %rdi
    pushq    %rsi
    call    __startup_64
    popq    %rsi
unsigned log __head __startup_64(unsigned long physaddr,
                 struct boot_params *bp)
    unsigned long load_delta, *p;
    unsigned long pgtable_flags;
    pgdval_t *pgd;
    p4dval_t *p4d;
    pudval_t *pud;
    pmdval_t *pmd, pmd_entry;
    pteval_t *mask_ptr;
    bool la57;
    int i;
    unsigned int *next_pgt_ptr;

Since kASLR is enabled, the address startup_64 routine was loaded may be different from the address compiled to run at, so we need to calculate the delta with the following code:

    load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);

As a result, load_delta contains the delta between the address compiled to run at and the address actually loaded.

After we got the delta, we check if _text address is correctly aligned for 2 megabytes. We will do it with the following code:

    if (load_delta & ~PMD_PAGE_MASK)
        for (;;);

If _text address is not aligned for 2 megabytes, we enter infinite loop. The PMD_PAGE_MASK indicates the mask for Page middle directory (read Paging about it) and is defined as:

#define PMD_PAGE_MASK           (~(PMD_PAGE_SIZE-1))

where PMD_PAGE_SIZE macro is defined as:

#define PMD_PAGE_SIZE           (_AC(1, UL) << PMD_SHIFT)
#define PMD_SHIFT        21

As we can easily calculate, PMD_PAGE_SIZE is 2 megabytes.

If SME is supported and enabled, we activate it and include the SME encryption mask in load_delta:

    load_delta += sme_get_me_mask();

Okay, we did some early checks and now we can move on.

Fix base addresses of page tables

In the next step we fixup the physical addresses in the page table:

    pgd = fixup_pointer(&early_top_pgt, physaddr);
    pud = fixup_pointer(&level3_kernel_pgt, physaddr);
    pmd = fixup_pointer(level2_fixmap_pgt, physaddr);

So, let's look at the definition of fixup_pointer function which returns physical address of the passed argument:

static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
    return ptr - (void *)_text + (void *)physaddr;

Next we'll focus on early_top_pgt and the other page table symbols which we saw above. Let's try to understand what these symbols mean. First of all let's look at their definition:

    .fill    512,8,0
    .fill    PTI_USER_PGD_FILL,8,0

    .fill    L3_START_KERNEL,8,0
    .quad    level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
    .quad    level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC


    .fill    506,8,0
    .quad    level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
    .fill    5,8,0

    .fill    512,8,0

Looks hard, but it isn't. First of all let's look at the early_top_pgt. It starts with the 4096 bytes of zeros (or 8192 bytes if CONFIG_PAGE_TABLE_ISOLATION is enabled), it means that we don't use the first 512 entries. And after this we can see level3_kernel_pgt entry. At the start of its definition, we can see that it is filled with the 4080 bytes of zeros (L3_START_KERNEL equals 510). Subsequently, it stores two entries which map kernel space. Note that we subtract __START_KERNEL_map from level2_kernel_pgt and level2_fixmap_pgt. As we know __START_KERNEL_map is a base virtual address of the kernel text, so if we subtract __START_KERNEL_map, we will get physical addresses of the level2_kernel_pgt and level2_fixmap_pgt.

Next let's look at _KERNPG_TABLE_NOENC and _PAGE_TABLE_NOENC, these are just page entry access rights:

                   _PAGE_ACCESSED | _PAGE_DIRTY)

The level2_kernel_pgt is page table entry which contains pointer to the page middle directory which maps kernel space. It calls the PDMS macro which creates 512 megabytes from the __START_KERNEL_map for kernel .text (after these 512 megabytes will be module memory space).

The level2_fixmap_pgt is a virtual addresses which can refer to any physical addresses even under kernel space. They are represented by the 4048 bytes of zeros, the level1_fixmap_pgt entry, 8 megabytes reserved for vsyscalls mapping and 2 megabytes of hole.

You can read more about it in the Paging part.

Now, after we saw the definitions of these symbols, let's get back to the code. Next we initialize last entry of pgd with level3_kernel_pgt:

    pgd[pgd_index(__START_KERNEL_map)] = level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC;

All of p*d addresses may be wrong if the startup_64 is not equal to default 0x1000000 address. Remember that the load_delta contains delta between the address of the startup_64 symbol which was got during kernel linking and the actual address. So we add the delta to the certain entries of the p*d.

    pgd[pgd_index(__START_KERNEL_map)] += load_delta;
    pud[510] += load_delta;
    pud[511] += load_delta;
    pmd[506] += load_delta;

After all of this we will have:

early_top_pgt[511] -> level3_kernel_pgt[0]
level3_kernel_pgt[510] -> level2_kernel_pgt[0]
level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
level2_kernel_pgt[0]   -> 512 MB kernel mapping
level2_fixmap_pgt[506] -> level1_fixmap_pgt

Note that we didn't fixup base address of the early_top_pgt and some of other page table directories, because we will see this when building/filling structures of these page tables. As we corrected base addresses of the page tables, we can start to build it.

Identity mapping setup

Now we can see the set up of identity mapping of early page tables. In Identity Mapped Paging, virtual addresses are mapped to physical addresses identically. Let's look at it in detail. First of all we replace pud and pmd with the pointer to first and second entry of early_dynamic_pgts:

    next_pgt_ptr = fixup_pointer(&next_early_pgt, physaddr);
    pud = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
    pmd = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);

Let's look at the early_dynamic_pgts definition:

    .fill    512*EARLY_DYNAMIC_PAGE_TABLES,8,0

which will store temporary page tables for early kernel.

Next we initialize pgtable_flags which will be used when initializing p*d entries later:

    pgtable_flags = _KERNPG_TABLE_NOENC + sme_get_me_mask();

sme_get_me_mask function returns sme_me_mask which was initialized in sme_enable function.

Next we fill two entries of pgd with pud plus pgtable_flags which we initialized above:

    i = (physaddr >> PGDIR_SHIFT) % PTRS_PER_PGD;
    pgd[i + 0] = (pgdval_t)pud + pgtable_flags;
    pgd[i + 1] = (pgdval_t)pud + pgtable_flags;

PGDIR_SHFT indicates the mask for page global directory bits in a virtual address. Here we calculate modulo with PTRS_PER_PGD (which expands to 512) so as not to access the index greater than 512. There are macro for all types of page directories:

#define PGDIR_SHIFT     39
#define PTRS_PER_PGD    512
#define PUD_SHIFT       30
#define PTRS_PER_PUD    512
#define PMD_SHIFT       21
#define PTRS_PER_PMD    512

We do the almost same thing above:

    i = (physaddr >> PUD_SHIFT) % PTRS_PER_PUD;
    pud[i + 0] = (pudval_t)pmd + pgtable_flags;
    pud[i + 1] = (pudval_t)pmd + pgtable_flags;

Next we initialize pmd_entry and filter out unsupported __PAGE_KERNEL_* bits:

    mask_ptr = fixup_pointer(&__supported_pte_mask, physaddr);
    pmd_entry &= *mask_ptr;
    pmd_entry += sme_get_me_mask();
    pmd_entry += physaddr;

Next we fill all pmd entries to cover full size of the kernel:

    for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++) {
        int idx = i + (physaddr >> PMD_SHIFT) % PTRS_PER_PMD;
        pmd[idx] = pmd_entry + i * PMD_SIZE;

Next we fixup the kernel text+data virtual addresses. Note that we might write invalid pmds, when the kernel is relocated (cleanup_highmap function fixes this up along with the mappings beyond _end).

    pmd = fixup_pointer(level2_kernel_pgt, physaddr);
    for (i = 0; i < PTRS_PER_PMD; i++) {
        if (pmd[i] & _PAGE_PRESENT)
            pmd[i] += load_delta;

Next we remove the memory encryption mask to obtain the true physical address (remember that load_delta includes the mask):

    *fixup_long(&phys_base, physaddr) += load_delta - sme_get_me_mask();

phys_base must match the first entry in level2_kernel_pgt.

As final step of __startup_64 function, we encrypt the kernel (if SME is active) and return the SME encryption mask to be used as a modifier for the initial page directory entry programmed into cr3 register:

    return sme_get_me_mask();

Now let's get back to assembly code. We prepare for next paragraph with following code:

    addq    $(early_top_pgt - __START_KERNEL_map), %rax
    jmp 1f

which adds physical address of early_top_pgt to rax register so that rax register contains sum of the address and the SME encryption mask.

That's all for now. Our early paging is prepared and we just need to finish last preparation before we will jump into kernel entry point.

Last preparation before jump at the kernel entry point

After that we jump to the label 1 we enable PAE, PGE (Paging Global Extension) and put the content of the phys_base (see above) to the rax register and fill cr3 register with it:

    movl    $(X86_CR4_PAE | X86_CR4_PGE), %ecx
    movq    %rcx, %cr4

    addq    phys_base(%rip), %rax
    movq    %rax, %cr3

In the next step we check that CPU supports NX bit with:

    movl    $0x80000001, %eax
    movl    %edx,%edi

We put 0x80000001 value to the eax and execute cpuid instruction for getting the extended processor info and feature bits. The result will be in the edx register which we put to the edi.

Now we put 0xc0000080 or MSR_EFER to the ecx and execute rdmsr instruction for the reading model specific register.

    movl    $MSR_EFER, %ecx

The result will be in the edx:eax. General view of the EFER is following:

63                                                                              32
|                                                                               |
|                                Reserved MBZ                                   |
|                                                                               |
31                            16  15      14      13   12  11   10  9  8 7  1   0
|                              | T |       |       |    |   |   |   |   |   |   |
| Reserved MBZ                 | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE|
|                              | E |       |       |    |   |   |   |   |   |   |

We will not see all fields in details here, but we will learn about this and other MSRs in a special part about it. As we read EFER to the edx:eax, we check _EFER_SCE or zero bit which is System Call Extensions with btsl instruction and set it to one. By the setting SCE bit we enable SYSCALL and SYSRET instructions. In the next step we check 20th bit in the edi, remember that this register stores result of the cpuid (see above). If 20 bit is set (NX bit) we just write EFER_SCE to the model specific register.

    btsl    $_EFER_SCE, %eax
    btl    $20,%edi
    jnc     1f
    btsl    $_EFER_NX, %eax
    btsq    $_PAGE_BIT_NX,early_pmd_flags(%rip)
1:    wrmsr

If the NX bit is supported we enable _EFER_NX and write it too, with the wrmsr instruction. After the NX bit is set, we set some bits in the cr0 control register with following assembly code:

    movl    $CR0_STATE, %eax
    movq    %rax, %cr0

specifically the following bits:

  • X86_CR0_PE - system is in protected mode;
  • X86_CR0_MP - controls interaction of WAIT/FWAIT instructions with TS flag in CR0;
  • X86_CR0_ET - on the 386, it allowed to specify whether the external math coprocessor was an 80287 or 80387;
  • X86_CR0_NE - enable internal x87 floating point error reporting when set, else enables PC style x87 error detection;
  • X86_CR0_WP - when set, the CPU can't write to read-only pages when privilege level is 0;
  • X86_CR0_AM - alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3;
  • X86_CR0_PG - enable paging.

We already know that to run any code, and even more C code from assembly, we need to setup a stack. As always, we are doing it by the setting of stack pointer to a correct place in memory and resetting flags register after this:

    movq initial_stack(%rip), %rsp
    pushq $0

The most interesting thing here is the initial_stack. This symbol is defined in the source code file and looks like:

    .quad  init_thread_union + THREAD_SIZE - SIZEOF_PTREGS

The THREAD_SIZE macro is defined in the arch/x86/include/asm/page_64_types.h header file and depends on value of the KASAN_STACK_ORDER macro:



We consider when the kasan is disabled and the PAGE_SIZE is 4096 bytes. So the THREAD_SIZE will expands to 16 kilobytes and represents size of the stack of a thread. Why is thread? You may already know that each process may have parent processes and child processes. Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the union with the thread_info structure.

The init_thread_union is represented by the thread_union. And the thread_union is defined in the include/linux/sched.h file like the following:

union thread_union {
    struct task_struct task;
    struct thread_info thread_info;
    unsigned long stack[THREAD_SIZE/sizeof(long)];

The CONFIG_ARCH_TASK_STRUCT_ON_STACK kernel configuration option is only enabled for ia64 architecture, and the CONFIG_THREAD_INFO_IN_TASK kernel configuration option is enabled for x86_64 architecture. Thus the thread_info structure will be placed in task_struct structure instead of the thread_union union.

The init_thread_union is placed in the include/asm-generic/ file as part of the INIT_TASK_DATA macro like the following:

#define INIT_TASK_DATA(align)  \
    . = ALIGN(align);      \
    ...                    \
    init_thread_union = .; \

This macro is used in the arch/x86/kernel/ file like the following:

.data : AT(ADDR(.data) - LOAD_OFFSET) {
} :data

That is, init_thread_union is initialized with the address which is aligned to THREAD_SIZE which is 16 kilobytes.

Now we may understand this expression:

    .quad  init_thread_union + THREAD_SIZE - SIZEOF_PTREGS

that initial_stack symbol points to the start of the thread_union.stack array + THREAD_SIZE which is 16 killobytes and - SIZEOF_PTREGS which is convention which helps the in-kernel unwinder reliably detect the end of the stack.

After the early boot stack is set, to update the Global Descriptor Table with the lgdt instruction:

lgdt    early_gdt_descr(%rip)

where the early_gdt_descr is defined as:

    .word    GDT_ENTRIES*8-1
    .quad    INIT_PER_CPU_VAR(gdt_page)

We need to reload Global Descriptor Table because now kernel works in the low userspace addresses, but soon kernel will work in its own space.

Now let's look at the definition of early_gdt_descr. GDT_ENTRIES expands to 32 so that Global Descriptor Table contains 32 entries for kernel code, data, thread local storage segments and etc...

Now let's look at the definition of early_gdt_descr_base. The gdt_page structure is defined in the arch/x86/include/asm/desc.h as:

struct gdt_page {
    struct desc_struct gdt[GDT_ENTRIES];
} __attribute__((aligned(PAGE_SIZE)));

It contains one field gdt which is array of the desc_struct structure which is defined as:

struct desc_struct {
         union {
                 struct {
                         unsigned int a;
                         unsigned int b;
                 struct {
                         u16 limit0;
                         u16 base0;
                         unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
                         unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
 } __attribute__((packed));

which looks familiar GDT descriptor. Note that gdt_page structure is aligned to PAGE_SIZE which is 4096 bytes. Which means that gdt will occupy one page.

Now let's try to understand what INIT_PER_CPU_VAR is. INIT_PER_CPU_VAR is a macro which is defined in the arch/x86/include/asm/percpu.h and just concatenates init_per_cpu__ with the given parameter:

#define INIT_PER_CPU_VAR(var) init_per_cpu__##var

After the INIT_PER_CPU_VAR macro will be expanded, we will have init_per_cpu__gdt_page. We can see the initialization of init_per_cpu__gdt_page in the linker script:

#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load

As we got init_per_cpu__gdt_page in INIT_PER_CPU_VAR and INIT_PER_CPU macro from linker script will be expanded we will get offset from the __per_cpu_load. After this calculations, we will have correct base address of the new GDT.

Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create per-CPU variable, each CPU will have its own copy of this variable. Here we are creating gdt_page per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own GDT table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about per-CPU variables in the Concepts/per-cpu post.

As we loaded new Global Descriptor Table, we reload segments as we did it every time:

    xorl %eax,%eax
    movl %eax,%ds
    movl %eax,%ss
    movl %eax,%es
    movl %eax,%fs
    movl %eax,%gs

After all of these steps we set up gs register that it post to the irqstack which represents special stack where interrupts will be handled on:

    movl    $MSR_GS_BASE,%ecx
    movl    initial_gs(%rip),%eax
    movl    initial_gs+4(%rip),%edx

where MSR_GS_BASE is:

#define MSR_GS_BASE             0xc0000101

We need to put MSR_GS_BASE to the ecx register and load data from the eax and edx (which point to the initial_gs) with wrmsr instruction. We don't use cs, fs, ds and ss segment registers for addressing in the 64-bit mode, but fs and gs registers can be used. fs and gs have a hidden part (as we saw it in the real mode for cs) and this part contains a descriptor which is mapped to Model Specific Registers. So we can see above 0xc0000101 is a gs.base MSR address. When a system call or interrupt occurs, there is no kernel stack at the entry point, so the value of the MSR_GS_BASE will store address of the interrupt stack.

In the next step we put the address of the real mode bootparam structure to the rdi (remember rsi holds pointer to this structure from the start) and jump to the C code with:

    pushq    $.Lafter_lret    # put return address on stack for unwinder
    xorq    %rbp, %rbp    # clear frame pointer
    movq    initial_code(%rip), %rax
    pushq    $__KERNEL_CS    # set correct cs
    pushq    %rax        # target address in negative space

Here we put the address of the initial_code to the rax and push the return address, __KERNEL_CS and the address of the initial_code to the stack. After this we can see lretq instruction which means that after it return address will be extracted from stack (now there is address of the initial_code) and jump there. initial_code is defined in the same source code file and looks:

    .balign    8
    .quad    x86_64_start_kernel

As we can see initial_code contains address of the x86_64_start_kernel, which is defined in the arch/x86/kerne/head64.c and looks like this:

asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)

It has one argument is a real_mode_data (remember that we passed address of the real mode data to the rdi register previously).

Next to start_kernel

We need to see last preparations before we can see "kernel entry point" - start_kernel function from the init/main.c.

First of all we can see some checks in the x86_64_start_kernel function:

BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

There are checks for different things like virtual address of module space is not fewer than base address of the kernel text - __STAT_KERNEL_map, that kernel text with modules is not less than image of the kernel and etc... BUILD_BUG_ON is a macro which looks as:

#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))

Let's try to understand how this trick works. Let's take for example first condition: MODULES_VADDR < __START_KERNEL_map. !!conditions is the same that condition != 0. So it means if MODULES_VADDR < __START_KERNEL_map is true, we will get 1 in the !!(condition) or zero if not. After 2*!!(condition) we will get or 2 or 0. In the end of calculations we can get two different behaviors:

  • We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because MODULES_VADDR can't be less than __START_KERNEL_map will be in our case);
  • No compilation errors.

That's all. So interesting C trick for getting compile error which depends on some constants.

In the next step we can see call of the cr4_init_shadow function which stores shadow copy of the cr4 per cpu. Context switches can change bits in the cr4 so we need to store cr4 for each CPU. And after this we can see call of the reset_early_page_tables function where we resets all page global directory entries and write new pointer to the PGT in cr3:

    memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
    next_early_pgt = 0;

Soon we will build new page tables. Here we can see that we zero all Page Global Directory entries. After this we set next_early_pgt to zero (we will see details about it in the next post) and write physical address of the early_top_pgt to the cr3.

After this we clear _bss from the __bss_stop to __bss_start and also clear init_top_pgt. init_top_pgt is defined in the arch/x86/kerne/head_64.S like the following:

    .fill    512,8,0
    .fill    PTI_USER_PGD_FILL,8,0

This is exactly the same definition as early_top_pgt.

The next step will be setup of the early IDT handlers, but it's big concept so we will see it in the next post.


This is the end of the first part about linux kernel initialization.

If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.

In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and a lot more.

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.

results matching ""

    No results matching ""