185 lines
		
	
	
	
		
			9.6 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			185 lines
		
	
	
	
		
			9.6 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
Pages tables enables each process to have its own private memory space.
 | 
						|
It's slicing the memory into small pages (PGSIZE, 4KiB) so we can
 | 
						|
distribute it to many processes without much fragmentation, and enables
 | 
						|
xv6 a few tricks:
 | 
						|
 | 
						|
- mapping the same memory (a trampoline page) in several address spaces
 | 
						|
- guarding kernel and user stacks with an unmapped page
 | 
						|
 | 
						|
3.1 Paging hardware
 | 
						|
 | 
						|
 | 
						|
RISC-V instruction (both user and kernel) manipulate virtual addresses.
 | 
						|
Physical memory (RAM) = physical addresses (phy@ in the notes).
 | 
						|
Virtual memory = fake memory addresses (virt@ in the notes).
 | 
						|
RISC-V page table hardware maps virt@ and phy@.
 | 
						|
 | 
						|
XV6 = Sv39 RISC-V = only bottom 39 bits for virt@ (top 25 bits are not used).
 | 
						|
2^39 @ = 2^27 page table entries (PTEs)
 | 
						|
a PTE = 44 bits of physical page number (PPN) + flags
 | 
						|
 | 
						|
vocabulary, acronyms, et caetera:
 | 
						|
- physical/virtual addresses = phy@/virt@
 | 
						|
- physical address (what the computer can address) = 2^56 = 65536 TiB
 | 
						|
- virtual address (what a single process can address) = 2^39 = 512 GiB
 | 
						|
- Page Table Entry = PTE
 | 
						|
- Physical Page Number = PPN
 | 
						|
- Page Directory = page table is split into 3 "small page tables" called page directories
 | 
						|
- Translation Look-aside Buffer = TLB
 | 
						|
- Supervisor Address Translation and Protection = satp (it's a register)
 | 
						|
- address space = set of valid virtual addresses in a given page table
 | 
						|
                = the kernel also has its own address space
 | 
						|
- User memory = its address space + physical memory allowed by the page table
 | 
						|
- Virtual memory = ideas and techniques associated with managing page tables
 | 
						|
                   used to achieve isolation & such
 | 
						|
- direct mapping = virt@ == phy@
 | 
						|
- KVM/UVM = Kernel/User virtual memory
 | 
						|
- a page table = 512 PTEs = can be contained into 1 memory page
 | 
						|
 | 
						|
virt@ = [ 25-bit EXT ; 27-bit index ; 12-bit offset ]
 | 
						|
        ↑64          ↑39            ↑12             ↑0
 | 
						|
                              index = index to the PPN in the page table
 | 
						|
page table = 2^27 entries
 | 
						|
page table entry = [ 44-bit PPN ; 10-bit flags ]
 | 
						|
                   ↑54          ↑10            ↑0
 | 
						|
phy@ = [ 44-bit PPN (indexed by virt@ index) ; 12-bit virt@ offset ]
 | 
						|
       ↑56                                   ↑12                   ↑0
 | 
						|
 | 
						|
virt@ = 39 (usable) bits, phy@ = 56 bits
 | 
						|
 | 
						|
Paging hardware translates virt@ with its top 27 of the 39 bits to find a PTE
 | 
						|
 | 
						|
IN REAL LIFE there the page table is split into 3 small tables, the index
 | 
						|
is split into 3*9 bits to be used as indexes for these 3 tables.
 | 
						|
These 3 parts can be called "directories":
 | 
						|
1. a "root" (a 4096-byte page table)
 | 
						|
   contains 512 PTEs which contain phy@ for next level directory
 | 
						|
2. a "middle" (idem)
 | 
						|
3. a "final"
 | 
						|
In Sv48, there is an extra page table before "root" which takes bits 39
 | 
						|
through 47 of a virt@.
 | 
						|
 | 
						|
In case the 3 PTEs points to an invalid page, the page hardware raises a
 | 
						|
"page-fault exception" (execeptions explained in chapter 4).
 | 
						|
 | 
						|
RISC-V CPU caches page table entries in a Translation Look-aside Buffer
 | 
						|
(TLB) to avoid costly loads of the PTE from the physical memory.
 | 
						|
 | 
						|
PTE flags:
 | 
						|
- PTE_V valid (is the PTE is present or not)
 | 
						|
- PTE_U user mode (if not present, can only be used in supervisor mode)
 | 
						|
- PTE_R read: can the instructions read the page?
 | 
						|
- PTE_W write: can the instructions write the page?
 | 
						|
- PTE_X execute: can the CPU interpret the page's content as instructions and execute them?
 | 
						|
 | 
						|
satp register = where to put the phy@ of the root page table to be used by the CPU.
 | 
						|
Once satp, subsequent instructions are interpreted with the provided page table.
 | 
						|
Before setting satp, instructions use phy@.
 | 
						|
Each CPU its own satp so each CPU can handle user code.
 | 
						|
 | 
						|
XV6 = 1 page table per process and 1 page table for the kernel
 | 
						|
XV6 kernel page table:
 | 
						|
- direct mapping for most pages
 | 
						|
- no direct mapping for the trampoline page
 | 
						|
- no direct mapping for stacks' pages
 | 
						|
  => these pages are related to the processes (`kstack` in the `proc` structure)
 | 
						|
  => the kstack is followed by an invalid guard page (PTE_V not set)
 | 
						|
     to prevent memory corruption from stack overflows
 | 
						|
 | 
						|
Virtual memory functions:
 | 
						|
  - walk: find PTE for a virt@ (can allocate a PTE in a page table)
 | 
						|
  - mappages: install PTEs for new mappings
 | 
						|
  - kvm_* = kernel virtual memory functions (kernel page table)
 | 
						|
  - uvm_* = same but for a user process
 | 
						|
  - copyin = copy data from user space to kernel space
 | 
						|
  - copyout = copy data from kernel space to user space
 | 
						|
  - copyinstr = copy a null-terminated string from user to kernel
 | 
						|
              = used for paths given in syscalls for example
 | 
						|
  - kvminit sets the root kernel page table created by kvmmake
 | 
						|
  - kvmmake creates a direct-map page table for the kernel
 | 
						|
    1. create the root kernel page table with a call to `kalloc`
 | 
						|
       (kalloc provides a pointer to a page, which is the type `pagetable_t`)
 | 
						|
    2. call kvmmap (an overlay of `mappages` to handle errors) multiple times to set a few direct-map pages
 | 
						|
       kvmmap adds mapping to the kernel page table (when booting only), doesn't flush TLB or enable paging
 | 
						|
       mapped stuff:
 | 
						|
         uart registers, virtio mmio disk interface, PLIC, kernel text and data
 | 
						|
         and trampoline (for trap entry/exit) is mapped to the highest virtual address in the kernel
 | 
						|
    3. proc_mapstacks allocates a kernel stack for each process in the `proc` (static) array of processes in `proc.c`
 | 
						|
       each kernel stack page is placed under the TRAMPOLINE kernel page with a following guard page (invalid page)
 | 
						|
 | 
						|
       side note: the function is complex for no reason, it uses pointer arithmetics just to get an index (0-7)
 | 
						|
 | 
						|
       TRAMPOLINE is a macro to get the phy@ of the trampoline page (MAXVA - PGSIZE)
 | 
						|
       KSTACK(p) is a macro to place a kernel page bellow the trampoline page with a guard page
 | 
						|
       KSTACK(p) = (TRAMPOLINE - ((p)+1)* 2*PGSIZE)
 | 
						|
 | 
						|
       TRAMPOLINE & KSTACK are macros in memlayout.h and are use in proc_mapstacks
 | 
						|
  - main calls kvminithart which sets satp to the kernel root page table so the CPU can start using it
 | 
						|
  - each CPU caches PTEs in a Translation Look-aside Buffer (TLB)
 | 
						|
    => when xv6 changes a page table it must tell the CPU to invalidate cached entry in the TLB
 | 
						|
    => RISC-V has an `sfence.vma` instruction to flush the current CPU's TLB
 | 
						|
       => `kvminithart` uses it after initializing sapt
 | 
						|
       => `sfence.vma` is also called before setting sapt
 | 
						|
           "to ensure that preceding updates to the page table have completed,
 | 
						|
           and ensures that preceding loads and stores use the old page table,
 | 
						|
           not the new one"
 | 
						|
       => `TRAMPOLINE` page uses it before entering user space
 | 
						|
       => RISC-V CPUs can have different TLBs for different address spaces
 | 
						|
          => avoid flushing an entire TLB
 | 
						|
          => xv6 doesn't use this feature
 | 
						|
 | 
						|
Physical memory management consists of handling memory pages from the `kmem.freelist` table.
 | 
						|
An allocation is materialized by removing an entry from this table,
 | 
						|
freeing a page is about adding back the page to the list.
 | 
						|
 | 
						|
sbrk is implemented with the function `growproc` (in prog.c) which calls either uvmalloc or uvmdealloc.
 | 
						|
 | 
						|
Function signatures (for reference):
 | 
						|
  void kvminit(void); // set the kernel root page table
 | 
						|
  pagetable_t kvmmake(void); // create the kernel page table
 | 
						|
  void kvmmap(pagetable_t, uint64 virt@, uint64 phy@, uint64 sz, int perm); // add PTEs to the kernel page table
 | 
						|
   => `kvmmap` is a simple overlay for `mappages` (automatically calls `panic` if an error occurs)
 | 
						|
  int mappages(pagetable_t, uint64 virt@, uint64 size, uint64 phy@, int perm); // create PTEs
 | 
						|
   => `kvmmap` is a simple overlay for this function (automatically calls `panic` if an error occurs)
 | 
						|
   => uses walk to find a PTE based on a virt@ (`walk` can also allocate a page in a page table)
 | 
						|
  pte_t * walk(pagetable_t pagetable, uint64 va, int alloc); // virt@ -> PTE (can allocate it if 'alloc' is set)
 | 
						|
   => return the address of the PTE in the lowest layer in the tree
 | 
						|
  uint64 walkaddr(pagetable_t pagetable, uint64 va); // virt@ -> phy@
 | 
						|
  void proc_mapstacks(pagetable_t kpgtbl); // allocate a kernel stack for a process
 | 
						|
 | 
						|
 | 
						|
3.8 Code: exec
 | 
						|
 | 
						|
exec = syscall replacing a process's user address space with data read from a file
 | 
						|
     = related files: kernel/{elf.h,exec.c}
 | 
						|
 | 
						|
  1. open a file with `namei`
 | 
						|
  2. read ELF header (see kernel/elf.h for more info) matching structure `elfhdr`
 | 
						|
  3. read subsequent ELF section headers corresponding to the `proghdr` structure
 | 
						|
     each of them describing a part of the application that must be loaded into memory
 | 
						|
     xv6 only has two program section headers: instructions and data
 | 
						|
 | 
						|
How exec works
 | 
						|
 | 
						|
  1. check if the file actually is an ELF binary
 | 
						|
  2. allocate a new page table with no user mapping with `proc_pagetable` (kernel/exec.c:49)
 | 
						|
  3. allocate memory for each ELF segment with `uvmalloc` (kernel/exec.c:65)
 | 
						|
  4. load each segment with `loadseg` (kernel/exec.c:10)
 | 
						|
     `loadseg` uses `readi` to read from the file
 | 
						|
     `loadseg` uses `walkaddr` to find the phy@ of the allocated memory at which to write ELF segments
 | 
						|
 | 
						|
To read sections from a binary: objdump -p file
 | 
						|
 | 
						|
DEVICE TREE (devicetree, DT or DTB for device tree blob or even Flattened Devicetree Blob)
 | 
						|
  => file representing the hardware so the kernel can initialize devices and provide them to the users.
 | 
						|
 | 
						|
Before DTs:
 | 
						|
  hardware was either hardcoded (with each system-on-chip being listed and their devices enumerated) or
 | 
						|
  detected with A LOT OF plateform-specific features such as BIOS, EFI, ACPI, etc.
 | 
						|
 | 
						|
DTs are an improvement over ACPI since ACPI isn't properly standardized and implementations are often buggy.
 | 
						|
 | 
						|
IN QEMU: this file can be created by QEMU and loaded by a bootloader somewhere in RAM, or provided to qemu.
 | 
						|
IN REAL LIFE: this file may be generated statically then loaded by the
 | 
						|
bootloader (or EFI?) in RAM then the address is provided to the kernel.
 | 
						|
Real kernels will get their configuration from different sources anyway.
 |