don't leave parts of the bootloader in the kernel's address space #239

Freax13 · 2022-06-26T19:29:44Z

While implementing finer granular ASLR I came across this comment:

bootloader/src/binary/level_4_entries.rs

Line 40 in ac46d04

used.entry_state[0] = true; // TODO: Can we do this dynamically?

We mark the first 512GiB of the address space as unusable for dynamically generated addresses. I think we do this because we identity map the context switch code into kernel memory and this code most likely resides within the first 512GiB of the address space:

bootloader/src/binary/mod.rs

Lines 166 to 181 in a445433

    
           // identity-map context switch function, so that we don't get an immediate pagefault 
        
           // after switching the active page table 
        
           let context_switch_function = PhysAddr::new(context_switch as *const () as u64); 
        
           let context_switch_function_start_frame: PhysFrame = 
        
               PhysFrame::containing_address(context_switch_function); 
        
           for frame in PhysFrame::range_inclusive( 
        
               context_switch_function_start_frame, 
        
               context_switch_function_start_frame + 1, 
        
           ) { 
        
               match unsafe { 
        
                   kernel_page_table.identity_map(frame, PageTableFlags::PRESENT, frame_allocator) 
        
               } { 
        
                   Ok(tlb) => tlb.flush(), 
        
                   Err(err) => panic!("failed to identity map frame {:?}: {:?}", frame, err), 
        
               } 
        
           }

This causes a number of (admittedly small and unlikely) problems:

The identity mapped pages could overlap with the kernel or other mappings
We don't expose the identity mapped addresses to the kernel in Mappings
An attacker could make use of the identity mapped pages to defeat ASLR
We mark so a lot of usable memory as unusable and because of that we can't check for overlaps because there would be a lot of false positives. We currently just ignore overlaps.

We could probably work around those problems while still mapping parts of the bootloader into the kernel's address space, but I'd like to propose another solution: We use another very short lived page table to do the context switch. This page table would only map a few pages containing code that switches to the kernel's page table. Importantly, we would set the page table up in such a way that the kernel's entrypoint is just after the page table switch instruction, so we don't have to use any code to jump to the kernel, it would simply be the next instruction.
I don't think we could reliably map such code into the bootloader's address space because we'd have to map the code just before the kernel's entrypoint which could be close to bootloader's code, so that's why I want to use a short-lived page table.

We also identity map a GDT into the kernel's address space:

bootloader/src/binary/mod.rs

Lines 183 to 193 in a445433

    
           // create, load, and identity-map GDT (required for working `iretq`) 
        
           let gdt_frame = frame_allocator 
        
               .allocate_frame() 
        
               .expect("failed to allocate GDT frame"); 
        
           gdt::create_and_load(gdt_frame); 
        
           match unsafe { 
        
               kernel_page_table.identity_map(gdt_frame, PageTableFlags::PRESENT, frame_allocator) 
        
           } { 
        
               Ok(tlb) => tlb.flush(), 
        
               Err(err) => panic!("failed to identity map frame {:?}: {:?}", gdt_frame, err), 
        
           }

We should probably make the GDT's location configurable and expose it in Mappings.

I'd be happy to work on a pr for this.

The text was updated successfully, but these errors were encountered:

bjorn3 · 2022-06-26T22:50:00Z

Can't the kernel make a page table from scratch and simply not map this memory range to the bootloader? I would expect any kernel implementing KASLR or a userspace to build their page tables from scratch and not identity map anything. AFAIK only the physical memory map needs to be respected. The virtual memory mapping can vary freely as a kernel wishes.

phil-opp · 2022-06-27T08:29:45Z

We use another very short lived page table to do the context switch. This page table would only map a few pages containing code that switches to the kernel's page table. Importantly, we would set the page table up in such a way that the kernel's entrypoint is just after the page table switch instruction, so we don't have to use any code to jump to the kernel, it would simply be the next instruction.

Interesting idea! However, AFAIK the kernels entry point address can be an arbitrary offset, e.g. in the middle of the .text section. So the memory before the entry point might already be used by other kernel code.

Freax13 · 2022-06-27T08:42:25Z

Can't the kernel make a page table from scratch and simply not map this memory range to the bootloader? I would expect any kernel implementing KASLR or a userspace to build their page tables from scratch [...]

Well in theory a kernel could do anything that we do in stage 4, so yeah they could totally just create their own page tables, but I'd argue that we shouldn't expect kernels to do that. Personally, in my kernel, I copy and update the page table created by the bootloader, but never create a new page table completely from scratch, and it's been working great.

[...] and not identity map anything.

That's exactly my point, none of the pages in the page in the page table created by the bootloader are identity mapped except for the context switch code and the GDT.

Freax13 · 2022-06-27T08:46:00Z

We use another very short lived page table to do the context switch. This page table would only map a few pages containing code that switches to the kernel's page table. Importantly, we would set the page table up in such a way that the kernel's entrypoint is just after the page table switch instruction, so we don't have to use any code to jump to the kernel, it would simply be the next instruction.

Interesting idea! However, AFAIK the kernels entry point address can be an arbitrary offset, e.g. in the middle of the .text section. So the memory before the entry point might already be used by other kernel code.

The short lived context switch page table wouldn't contain any entries from the kernel's page table, it'd just contain some entries to switch to the kernel's page table, so there's no way the two could overlap.

phil-opp · 2022-06-27T09:11:26Z

Ah, I think I understand what you mean now. Assuming the kernel's entry point address is 0x2ec060. We would then map the context switch function in the temp page table in a way that it lives on the same virtual page as the entry point? We also offset it within the page so that the page table reload happens exactly at the instruction before 0x2ec060? Does this always work without violating any alignment requirements?

Freax13 · 2022-06-27T11:00:07Z

Ah, I think I understand what you mean now. Assuming the kernel's entry point address is 0x2ec060. We would then map the context switch function in the temp page table in a way that it lives on the same virtual page as the entry point? We also offset it within the page so that the page table reload happens exactly at the instruction before 0x2ec060?

Yes, except that instead of mapping the context switch function, we might just write a the opcodes manually, I don't think we'll have to write many and it's probably easier/more reliable than making the function work when placed at a different address.

Does this always work without violating any alignment requirements?

Almost. I'm not aware of any alignment requirements that could cause problems, but there's another problem: This won't work if the entrypoint is placed right after the address space gap, the instruction pointer will not automatically jump the gap, so this will cause a GP. mov cr3, rax is a 3 byte instruction, so if the entrypoint is at 0xffff_8000_0000_0000, 0xffff_8000_0000_0001 or 0xffff_8000_0000_0002, this won't work. All other locations (including 0) should work fine though.

phil-opp · 2022-06-27T15:16:12Z

if the entrypoint is at 0xffff_8000_0000_0000, 0xffff_8000_0000_0001 or 0xffff_8000_0000_0002, this won't work

I don't think that there are kernels that link their .text section right at the lower/upper half boundary. So that should not be a problem.

Yes, except that instead of mapping the context switch function, we might just write a the opcodes manually, I don't think we'll have to write many and it's probably easier/more reliable than making the function work when placed at a different address.

Sounds like it would be worth a try! So feel free to open a PR if you like, preferably against the next branch (I'm trying my best to finish the rewrite soon).

Freax13 linked a pull request Jun 29, 2022 that will close this issue

implement a new context switch #240

Open

Freax13 mentioned this issue Nov 28, 2022

support higher half position independent kernels #289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't leave parts of the bootloader in the kernel's address space #239

don't leave parts of the bootloader in the kernel's address space #239

Freax13 commented Jun 26, 2022

bjorn3 commented Jun 26, 2022

phil-opp commented Jun 27, 2022

Freax13 commented Jun 27, 2022

Freax13 commented Jun 27, 2022

phil-opp commented Jun 27, 2022

Freax13 commented Jun 27, 2022

phil-opp commented Jun 27, 2022

don't leave parts of the bootloader in the kernel's address space #239

don't leave parts of the bootloader in the kernel's address space #239

Comments

Freax13 commented Jun 26, 2022

bjorn3 commented Jun 26, 2022

phil-opp commented Jun 27, 2022

Freax13 commented Jun 27, 2022

Freax13 commented Jun 27, 2022

phil-opp commented Jun 27, 2022

Freax13 commented Jun 27, 2022

phil-opp commented Jun 27, 2022