Kernel initialization. Part 1.

First steps in the kernel code

The previous post was a last part of the Linux kernel booting process chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with pid 1. There are many things to do before the kernel will start first init process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the arch/x86/kernel/head_64.S and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the start_kernel function from the init/main.c will be called.

In the last part of the previous chapter we stopped at the jmp instruction from the arch/x86/boot/compressed/head_64.S assembly source code file:

jmp    *%rax

At this moment the rax register contains address of the Linux kernel entry point which that was obtained as a result of the call of the decompress_kernel function from the arch/x86/boot/compressed/misc.c source code file. So, our last instruction in the kernel setup code is a jump on the kernel entry point. We already know where is defined the entry point of the linux kernel, so we are able to start to learn what does the Linux kernel does after the start.

First steps in the kernel

Okay, we got the address of the decompressed kernel image from the decompress_kernel function into rax register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the arch/x86/kernel/head_64.S assembly source code file and at the beginning of it, we can see following definitions:

    .globl startup_64

We can see definition of the startup_64 routine that is defined in the __HEAD section, which is just a macro which expands to the definition of executable .head.text section:

#define __HEAD        .section    ".head.text","ax"

We can see definition of this section in the arch/x86/kernel/ linker script:

.text : AT(ADDR(.text) - LOAD_OFFSET) {
    _text = .;
} :text = 0x9090

Besides the definition of the .text section, we can understand default virtual and physical addresses from the linker script. Note that address of the _text is location counter which is defined as:


for the x86_64. The definition of the __START_KERNEL macro is located in the arch/x86/include/asm/page_types.h header file and represented by the sum of the base virtual address of the kernel mapping and physical start:



Or in other words:

  • Base physical address of the Linux kernel - 0x1000000;
  • Base virtual address of the Linux kernel - 0xffffffff81000000.

Now we know default physical and virtual addresses of the startup_64 routine, but to know actual addresses we must to calculate it with the following code:

    leaq    _text(%rip), %rbp
    subq    $_text - __START_KERNEL_map, %rbp

Yes, it defined as 0x1000000, but it may be different, for example if kASLR is enabled. So our current goal is to calculate delta between 0x1000000 and where we actually loaded. Here we just put the rip-relative address to the rbp register and then subtract $_text - __START_KERNEL_map from it. We know that compiled virtual address of the _text is 0xffffffff81000000 and the physical address of it is 0x1000000. The __START_KERNEL_map macro expands to the 0xffffffff80000000 address, so at the second line of the assembly code, we will get following expression:

rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000)

So, after the calculation, the rbp will contain 0 which represents difference between addresses where we actually loaded and where the code was compiled. In our case zero means that the Linux kernel was loaded by default address and the kASLR was disabled.

After we got the address of the startup_64, we need to do a check that this address is correctly aligned. We will do it with the following code:

    testl    $~PMD_PAGE_MASK, %ebp
    jnz    bad_address

Here we just compare low part of the rbp register with the complemented value of the PMD_PAGE_MASK. The PMD_PAGE_MASK indicates the mask for Page middle directory (read paging about it) and defined as:

#define PMD_PAGE_MASK           (~(PMD_PAGE_SIZE-1))

where PMD_PAGE_SIZE macro defined as:

#define PMD_PAGE_SIZE           (_AC(1, UL) << PMD_SHIFT)
#define PMD_SHIFT       21

As we can easily calculate, PMD_PAGE_SIZE is 2 megabytes. Here we use standard formula for checking alignment and if text address is not aligned for 2 megabytes, we jump to bad_address label.

After this we check address that it is not too large by the checking of highest 18 bits:

    leaq    _text(%rip), %rax
    shrq    $MAX_PHYSMEM_BITS, %rax
    jnz    bad_address

The address must not be greater than 46-bits:

#define MAX_PHYSMEM_BITS       46

Okay, we did some early checks and now we can move on.

Fix base addresses of page tables

The first step before we start to setup identity paging is to fixup following addresses:

    addq    %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
    addq    %rbp, level3_kernel_pgt + (510*8)(%rip)
    addq    %rbp, level3_kernel_pgt + (511*8)(%rip)
    addq    %rbp, level2_fixmap_pgt + (506*8)(%rip)

All of early_level4_pgt, level3_kernel_pgt and other address may be wrong if the startup_64 is not equal to default 0x1000000 address. The rbp register contains the delta address so we add to the certain entries of the early_level4_pgt, the level3_kernel_pgt and the level2_fixmap_pgt. Let's try to understand what these labels mean. First of all let's look at their definition:

    .fill    511,8,0
    .quad    level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

    .fill    L3_START_KERNEL,8,0
    .quad    level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
    .quad    level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE


    .fill    506,8,0
    .quad    level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
    .fill    5,8,0

    .fill    512,8,0

Looks hard, but it isn't. First of all let's look at the early_level4_pgt. It starts with the (4096 - 8) bytes of zeros, it means that we don't use the first 511 entries. And after this we can see one level3_kernel_pgt entry. Note that we subtract __START_KERNEL_map + _PAGE_TABLE from it. As we know __START_KERNEL_map is a base virtual address of the kernel text, so if we subtract __START_KERNEL_map, we will get physical address of the level3_kernel_pgt. Now let's look at _PAGE_TABLE, it is just page entry access rights:

                         _PAGE_ACCESSED | _PAGE_DIRTY)

You can read more about it in the paging part.

The level3_kernel_pgt - stores two entries which map kernel space. At the start of it's definition, we can see that it is filled with zeros L3_START_KERNEL or 510 times. Here the L3_START_KERNEL is the index in the page upper directory which contains __START_KERNEL_map address and it equals 510. After this, we can see the definition of the two level3_kernel_pgt entries: level2_kernel_pgt and level2_fixmap_pgt. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has:


access rights. The second - level2_fixmap_pgt is a virtual addresses which can refer to any physical addresses even under kernel space. They represented by the one level2_fixmap_pgt entry and 10 megabytes hole for the vsyscalls mapping. The next level2_kernel_pgt calls the PDMS macro which creates 512 megabytes from the __START_KERNEL_map for kernel .text (after these 512 megabytes will be modules memory space).

Now, after we saw definitions of these symbols, let's get back to the code which is described at the beginning of the section. Remember that the rbp register contains delta between the address of the startup_64 symbol which was got during kernel linking and the actual address. So, for this moment, we just need to add this delta to the base address of some page table entries, that they'll have correct addresses. In our case these entries are:

    addq    %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
    addq    %rbp, level3_kernel_pgt + (510*8)(%rip)
    addq    %rbp, level3_kernel_pgt + (511*8)(%rip)
    addq    %rbp, level2_fixmap_pgt + (506*8)(%rip)

or the last entry of the early_level4_pgt which is the level3_kernel_pgt, last two entries of the level3_kernel_pgt which are the level2_kernel_pgt and the level2_fixmap_pgt and five hundreds seventh entry of the level2_fixmap_pgt which is level1_fixmap_pgt page directory.

After all of this we will have:

early_level4_pgt[511] -> level3_kernel_pgt[0]
level3_kernel_pgt[510] -> level2_kernel_pgt[0]
level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
level2_kernel_pgt[0]   -> 512 MB kernel mapping
level2_fixmap_pgt[507] -> level1_fixmap_pgt

Note that we didn't fixup base address of the early_level4_pgt and some of other page table directories, because we will see this during of building/filling of structures for these page tables. As we corrected base addresses of the page tables, we can start to build it.

Identity mapping setup

Now we can see the set up of identity mapping of early page tables. In Identity Mapped Paging, virtual addresses are mapped to physical addresses that have the same value, 1 : 1. Let's look at it in detail. First of all we get the rip-relative address of the _text and _early_level4_pgt and put they into rdi and rbx registers:

    leaq    _text(%rip), %rdi
    leaq    early_level4_pgt(%rip), %rbx

After this we store address of the _text in the rax and get the index of the page global directory entry which stores _text address, by shifting _text address on the PGDIR_SHIFT:

    movq    %rdi, %rax
    shrq    $PGDIR_SHIFT, %rax

where PGDIR_SHIFT is 39. PGDIR_SHFT indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories:

#define PGDIR_SHIFT     39
#define PUD_SHIFT       30
#define PMD_SHIFT       21

After this we put the address of the first entry of the early_dynamic_pgts page table to the rdx register with the _KERNPG_TABLE access rights (see above) and fill the early_level4_pgt with the 2 early_dynamic_pgts entries:

    leaq    (4096 + _KERNPG_TABLE)(%rbx), %rdx
    movq    %rdx, 0(%rbx,%rax,8)
    movq    %rdx, 8(%rbx,%rax,8)

The rbx register contains address of the early_level4_pgt and %rax * 8 here is index of a page global directory occupied by the _text address. So here we fill two entries of the early_level4_pgt with address of two entries of the early_dynamic_pgts which is related to _text. The early_dynamic_pgts is array of arrays:

extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];

which will store temporary page tables for early kernel which we will not move to the init_level4_pgt.

After this we add 4096 (size of the early_level4_pgt) to the rdx (it now contains the address of the first entry of the early_dynamic_pgts) and put rdi (it now contains physical address of the _text) to the rax. Now we shift address of the _text ot PUD_SHIFT to get index of an entry from page upper directory which contains this address and clears high bits to get only pud related part:

    addq    $4096, %rdx
    movq    %rdi, %rax
    shrq    $PUD_SHIFT, %rax
    andl    $(PTRS_PER_PUD-1), %eax

As we have index of a page upper directory we write two addresses of the second entry of the early_dynamic_pgts array to the first entry of this temporary page directory:

    movq    %rdx, 4096(%rbx,%rax,8)
    incl    %eax
    andl    $(PTRS_PER_PUD-1), %eax
    movq    %rdx, 4096(%rbx,%rax,8)

In the next step we do the same operation for last page table directory, but filling not two entries, but all entries to cover full size of the kernel.

After our early page table directories filled, we put physical address of the early_level4_pgt to the rax register and jump to label 1:

    movq    $(early_level4_pgt - __START_KERNEL_map), %rax
    jmp 1f

That's all for now. Our early paging is prepared and we just need to finish last preparation before we will jump into C code and kernel entry point later.

Last preparation before jump at the kernel entry point

After that we jump to the label 1 we enable PAE, PGE (Paging Global Extension) and put the physical address of the phys_base (see above) to the rax register and fill cr3 register with it:

    movl    $(X86_CR4_PAE | X86_CR4_PGE), %ecx
    movq    %rcx, %cr4

    addq    phys_base(%rip), %rax
    movq    %rax, %cr3

In the next step we check that CPU supports NX bit with:

    movl    $0x80000001, %eax
    movl    %edx,%edi

We put 0x80000001 value to the eax and execute cpuid instruction for getting the extended processor info and feature bits. The result will be in the edx register which we put to the edi.

Now we put 0xc0000080 or MSR_EFER to the ecx and call rdmsr instruction for the reading model specific register.

    movl    $MSR_EFER, %ecx

The result will be in the edx:eax. General view of the EFER is following:

63                                                                              32
|                                                                               |
|                                Reserved MBZ                                   |
|                                                                               |
31                            16  15      14      13   12  11   10  9  8 7  1   0
|                              | T |       |       |    |   |   |   |   |   |   |
| Reserved MBZ                 | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE|
|                              | E |       |       |    |   |   |   |   |   |   |

We will not see all fields in details here, but we will learn about this and other MSRs in a special part about it. As we read EFER to the edx:eax, we check _EFER_SCE or zero bit which is System Call Extensions with btsl instruction and set it to one. By the setting SCE bit we enable SYSCALL and SYSRET instructions. In the next step we check 20th bit in the edi, remember that this register stores result of the cpuid (see above). If 20 bit is set (NX bit) we just write EFER_SCE to the model specific register.

    btsl    $_EFER_SCE, %eax
    btl        $20,%edi
    jnc     1f
    btsl    $_EFER_NX, %eax
    btsq    $_PAGE_BIT_NX,early_pmd_flags(%rip)
1:    wrmsr

If the NX bit is supported we enable _EFER_NX and write it too, with the wrmsr instruction. After the NX bit is set, we set some bits in the cr0 control register, namely:

  • X86_CR0_PE - system is in protected mode;
  • X86_CR0_MP - controls interaction of WAIT/FWAIT instructions with TS flag in CR0;
  • X86_CR0_ET - on the 386, it allowed to specify whether the external math coprocessor was an 80287 or 80387;
  • X86_CR0_NE - enable internal x87 floating point error reporting when set, else enables PC style x87 error detection;
  • X86_CR0_WP - when set, the CPU can't write to read-only pages when privilege level is 0;
  • X86_CR0_AM - alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3;
  • X86_CR0_PG - enable paging.

by the execution following assembly code:

#define CR0_STATE    (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
             X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
movl    $CR0_STATE, %eax
movq    %rax, %cr0

We already know that to run any code, and even more C code from assembly, we need to setup a stack. As always, we are doing it by the setting of stack pointer to a correct place in memory and resetting flags register after this:

movq initial_stack(%rip), %rsp
pushq $0

The most interesting thing here is the initial_stack. This symbol is defined in the source code file and looks like:

    .quad  init_thread_union+THREAD_SIZE-8

The GLOBAL is already familiar to us from. It defined in the arch/x86/include/asm/linkage.h header file expands to the global symbol definition:

#define GLOBAL(name)    \
         .globl name;           \

The THREAD_SIZE macro is defined in the arch/x86/include/asm/page_64_types.h header file and depends on value of the KASAN_STACK_ORDER macro:


We consider when the kasan is disabled and the PAGE_SIZE is 4096 bytes. So the THREAD_SIZE will expands to 16 kilobytes and represents size of the stack of a thread. Why is thread? You may already know that each process may have parent processes and child processes. Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the union with the thread_info structure.

And as we can see the init_thread_union is represented by the thread_union union. Earlier this union looked like:

union thread_union {
         struct thread_info thread_info;
         unsigned long stack[THREAD_SIZE/sizeof(long)];

but from the Linux kernel 4.9-rc1 release, thread_info was moved to the task_struct structure which represents a thread. So, for now thread_union looks like:

union thread_union {
    struct thread_info thread_info;
    unsigned long stack[THREAD_SIZE/sizeof(long)];

where the CONFIG_THREAD_INFO_IN_TASK kernel configuration option is enabled for x86_64 architecture. So, as we consider only x86_64 architecture in this book, an instance of thread_union will contain only stack and thread_info structure will be placed in the task_struct.

The init_thread_union looks like:

union thread_union init_thread_union __init_task_data = {

which represents just thread stack. Now we may understand this expression:

    .quad  init_thread_union+THREAD_SIZE-8

that initial_stack symbol points to the start of the thread_union.stack array + THREAD_SIZE which is 16 killobytes and - 8 bytes. Here we need to subtract 8 bytes at the to of stack. This is necessary to guarantee illegal access of the next page memory.

After the early boot stack is set, to update the Global Descriptor Table with the lgdt instruction:

lgdt    early_gdt_descr(%rip)

where the early_gdt_descr is defined as:

    .word    GDT_ENTRIES*8-1
    .quad    INIT_PER_CPU_VAR(gdt_page)

We need to reload Global Descriptor Table because now kernel works in the low userspace addresses, but soon kernel will work in its own space. Now let's look at the definition of early_gdt_descr. Global Descriptor Table contains 32 entries:

#define GDT_ENTRIES 32

for kernel code, data, thread local storage segments and etc... it's simple. Now let's look at the definition of the early_gdt_descr_base.

First of gdt_page defined as:

struct gdt_page {
    struct desc_struct gdt[GDT_ENTRIES];
} __attribute__((aligned(PAGE_SIZE)));

in the arch/x86/include/asm/desc.h. It contains one field gdt which is array of the desc_struct structure which is defined as:

struct desc_struct {
         union {
                 struct {
                         unsigned int a;
                         unsigned int b;
                 struct {
                         u16 limit0;
                         u16 base0;
                         unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
                         unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
 } __attribute__((packed));

and presents familiar to us GDT descriptor. Also we can note that gdt_page structure aligned to PAGE_SIZE which is 4096 bytes. It means that gdt will occupy one page. Now let's try to understand what is INIT_PER_CPU_VAR. INIT_PER_CPU_VAR is a macro which defined in the arch/x86/include/asm/percpu.h and just concats init_per_cpu__ with the given parameter:

#define INIT_PER_CPU_VAR(var) init_per_cpu__##var

After the INIT_PER_CPU_VAR macro will be expanded, we will have init_per_cpu__gdt_page. We can see in the linker script:

#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load

As we got init_per_cpu__gdt_page in INIT_PER_CPU_VAR and INIT_PER_CPU macro from linker script will be expanded we will get offset from the __per_cpu_load. After this calculations, we will have correct base address of the new GDT.

Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create per-CPU variable, each CPU will have will have its own copy of this variable. Here we creating gdt_page per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own GDT table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about per-CPU variables in the Theory/per-cpu post.

As we loaded new Global Descriptor Table, we reload segments as we did it every time:

    xorl %eax,%eax
    movl %eax,%ds
    movl %eax,%ss
    movl %eax,%es
    movl %eax,%fs
    movl %eax,%gs

After all of these steps we set up gs register that it post to the irqstack which represents special stack where interrupts will be handled on:

    movl    $MSR_GS_BASE,%ecx
    movl    initial_gs(%rip),%eax
    movl    initial_gs+4(%rip),%edx

where MSR_GS_BASE is:

#define MSR_GS_BASE             0xc0000101

We need to put MSR_GS_BASE to the ecx register and load data from the eax and edx (which are point to the initial_gs) with wrmsr instruction. We don't use cs, fs, ds and ss segment registers for addressing in the 64-bit mode, but fs and gs registers can be used. fs and gs have a hidden part (as we saw it in the real mode for cs) and this part contains descriptor which mapped to Model Specific Registers. So we can see above 0xc0000101 is a gs.base MSR address. When a system call or interrupt occurred, there is no kernel stack at the entry point, so the value of the MSR_GS_BASE will store address of the interrupt stack.

In the next step we put the address of the real mode bootparam structure to the rdi (remember rsi holds pointer to this structure from the start) and jump to the C code with:

    movq    initial_code(%rip), %rax
    pushq    $__KERNEL_CS    # set correct cs
    pushq    %rax        # target address in negative space

Here we put the address of the initial_code to the rax and push fake address, __KERNEL_CS and the address of the initial_code to the stack. After this we can see lretq instruction which means that after it return address will be extracted from stack (now there is address of the initial_code) and jump there. initial_code is defined in the same source code file and looks:

    .balign    8
    .quad    x86_64_start_kernel

As we can see initial_code contains address of the x86_64_start_kernel, which is defined in the arch/x86/kerne/head64.c and looks like this:

asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {

It has one argument is a real_mode_data (remember that we passed address of the real mode data to the rdi register previously).

This is first C code in the kernel!

Next to start_kernel

We need to see last preparations before we can see "kernel entry point" - start_kernel function from the init/main.c.

First of all we can see some checks in the x86_64_start_kernel function:

BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - __STAT_KERNEL_map, that kernel text with modules is not less than image of the kernel and etc... BUILD_BUG_ON is a macro which looks as:

#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))

Let's try to understand how this trick works. Let's take for example first condition: MODULES_VADDR < __START_KERNEL_map. !!conditions is the same that condition != 0. So it means if MODULES_VADDR < __START_KERNEL_map is true, we will get 1 in the !!(condition) or zero if not. After 2*!!(condition) we will get or 2 or 0. In the end of calculations we can get two different behaviors:

  • We will have compilation error, because try to get size of the char array with negative index (as can be in our case, because MODULES_VADDR can't be less than __START_KERNEL_map will be in our case);
  • No compilation errors.

That's all. So interesting C trick for getting compile error which depends on some constants.

In the next step we can see call of the cr4_init_shadow function which stores shadow copy of the cr4 per cpu. Context switches can change bits in the cr4 so we need to store cr4 for each CPU. And after this we can see call of the reset_early_page_tables function where we resets all page global directory entries and write new pointer to the PGT in cr3:

for (i = 0; i < PTRS_PER_PGD-1; i++)
    early_level4_pgt[i].pgd = 0;

next_early_pgt = 0;


Soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (PTRS_PER_PGD is 512) in the loop and make it zero. After this we set next_early_pgt to zero (we will see details about it in the next post) and write physical address of the early_level4_pgt to the cr3. __pa_nodebug is a macro which will be expanded to:

((unsigned long)(x) - __START_KERNEL_map + phys_base)

After this we clear _bss from the __bss_stop to __bss_start and the next step will be setup of the early IDT handlers, but it's big concept so we will see it in the next part.


This is the end of the first part about linux kernel initialization.

If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.

In the next part we will see initialization of the early interruption handlers, kernel space memory mapping and a lot more.

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.

results matching ""

    No results matching ""