xv6-riscv/notes/chapter3-page-tables

Pages tables enables each process to have its own private memory space.
It's slicing the memory into small pages (PGSIZE, 4KiB) so we can
distribute it to many processes without much fragmentation, and enables
xv6 a few tricks:

- mapping the same memory (a trampoline page) in several address spaces
- guarding kernel and user stacks with an unmapped page

3.1 Paging hardware


RISC-V instruction (both user and kernel) manipulate virtual addresses.
Physical memory (RAM) = physical addresses (phy@ in the notes).
Virtual memory = fake memory addresses (virt@ in the notes).
RISC-V page table hardware maps virt@ and phy@.

XV6 = Sv39 RISC-V = only bottom 39 bits for virt@ (top 25 bits are not used).
2^39 @ = 2^27 page table entries (PTEs)
a PTE = 44 bits of physical page number (PPN) + flags

vocabulary, acronyms, et caetera:
- physical/virtual addresses = phy@/virt@
- physical address (what the computer can address) = 2^56 = 65536 TiB
- virtual address (what a single process can address) = 2^39 = 512 GiB
- Page Table Entry = PTE
- Physical Page Number = PPN
- Page Directory = page table is split into 3 "small page tables" called page directories
- Translation Look-aside Buffer = TLB
- Supervisor Address Translation and Protection = satp (it's a register)
- address space = set of valid virtual addresses in a given page table
                = the kernel also has its own address space
- User memory = its address space + physical memory allowed by the page table
- Virtual memory = ideas and techniques associated with managing page tables
                   used to achieve isolation & such
- direct mapping = virt@ == phy@
- KVM/UVM = Kernel/User virtual memory
- a page table = 512 PTEs = can be contained into 1 memory page

virt@ = [ 25-bit EXT ; 27-bit index ; 12-bit offset ]
        ↑64          ↑39            ↑12             ↑0
                              index = index to the PPN in the page table
page table = 2^27 entries
page table entry = [ 44-bit PPN ; 10-bit flags ]
                   ↑54          ↑10            ↑0
phy@ = [ 44-bit PPN (indexed by virt@ index) ; 12-bit virt@ offset ]
       ↑56                                   ↑12                   ↑0

virt@ = 39 (usable) bits, phy@ = 56 bits

Paging hardware translates virt@ with its top 27 of the 39 bits to find a PTE

IN REAL LIFE there the page table is split into 3 small tables, the index
is split into 3*9 bits to be used as indexes for these 3 tables.
These 3 parts can be called "directories":
1. a "root" (a 4096-byte page table)
   contains 512 PTEs which contain phy@ for next level directory
2. a "middle" (idem)
3. a "final"
In Sv48, there is an extra page table before "root" which takes bits 39
through 47 of a virt@.

In case the 3 PTEs points to an invalid page, the page hardware raises a
"page-fault exception" (execeptions explained in chapter 4).

RISC-V CPU caches page table entries in a Translation Look-aside Buffer
(TLB) to avoid costly loads of the PTE from the physical memory.

PTE flags:
- PTE_V valid (is the PTE is present or not)
- PTE_U user mode (if not present, can only be used in supervisor mode)
- PTE_R read: can the instructions read the page?
- PTE_W write: can the instructions write the page?
- PTE_X execute: can the CPU interpret the page's content as instructions and execute them?

satp register = where to put the phy@ of the root page table to be used by the CPU.
Once satp, subsequent instructions are interpreted with the provided page table.
Before setting satp, instructions use phy@.
Each CPU its own satp so each CPU can handle user code.

XV6 = 1 page table per process and 1 page table for the kernel
XV6 kernel page table:
- direct mapping for most pages
- no direct mapping for the trampoline page
- no direct mapping for stacks' pages
  => these pages are related to the processes (`kstack` in the `proc` structure)
  => the kstack is followed by an invalid guard page (PTE_V not set)
     to prevent memory corruption from stack overflows

Virtual memory functions:
  - walk: find PTE for a virt@ (can allocate a PTE in a page table)
  - mappages: install PTEs for new mappings
  - kvm_* = kernel virtual memory functions (kernel page table)
  - uvm_* = same but for a user process
  - copyin = copy data from user space to kernel space
  - copyout = copy data from kernel space to user space
  - copyinstr = copy a null-terminated string from user to kernel
              = used for paths given in syscalls for example
  - kvminit sets the root kernel page table created by kvmmake
  - kvmmake creates a direct-map page table for the kernel
    1. create the root kernel page table with a call to `kalloc`
       (kalloc provides a pointer to a page, which is the type `pagetable_t`)
    2. call kvmmap (an overlay of `mappages` to handle errors) multiple times to set a few direct-map pages
       kvmmap adds mapping to the kernel page table (when booting only), doesn't flush TLB or enable paging
       mapped stuff:
         uart registers, virtio mmio disk interface, PLIC, kernel text and data
         and trampoline (for trap entry/exit) is mapped to the highest virtual address in the kernel
    3. proc_mapstacks allocates a kernel stack for each process in the `proc` (static) array of processes in `proc.c`
       each kernel stack page is placed under the TRAMPOLINE kernel page with a following guard page (invalid page)

       side note: the function is complex for no reason, it uses pointer arithmetics just to get an index (0-7)

       TRAMPOLINE is a macro to get the phy@ of the trampoline page (MAXVA - PGSIZE)
       KSTACK(p) is a macro to place a kernel page bellow the trampoline page with a guard page
       KSTACK(p) = (TRAMPOLINE - ((p)+1)* 2*PGSIZE)

       TRAMPOLINE & KSTACK are macros in memlayout.h and are use in proc_mapstacks
  - main calls kvminithart which sets satp to the kernel root page table so the CPU can start using it
  - each CPU caches PTEs in a Translation Look-aside Buffer (TLB)
    => when xv6 changes a page table it must tell the CPU to invalidate cached entry in the TLB
    => RISC-V has an `sfence.vma` instruction to flush the current CPU's TLB
       => `kvminithart` uses it after initializing sapt
       => `sfence.vma` is also called before setting sapt
           "to ensure that preceding updates to the page table have completed,
           and ensures that preceding loads and stores use the old page table,
           not the new one"
       => `TRAMPOLINE` page uses it before entering user space
       => RISC-V CPUs can have different TLBs for different address spaces
          => avoid flushing an entire TLB
          => xv6 doesn't use this feature

Physical memory management consists of handling memory pages from the `kmem.freelist` table.
An allocation is materialized by removing an entry from this table,
freeing a page is about adding back the page to the list.

sbrk is implemented with the function `growproc` (in prog.c) which calls either uvmalloc or uvmdealloc.

Function signatures (for reference):
  void kvminit(void); // set the kernel root page table
  pagetable_t kvmmake(void); // create the kernel page table
  void kvmmap(pagetable_t, uint64 virt@, uint64 phy@, uint64 sz, int perm); // add PTEs to the kernel page table
   => `kvmmap` is a simple overlay for `mappages` (automatically calls `panic` if an error occurs)
  int mappages(pagetable_t, uint64 virt@, uint64 size, uint64 phy@, int perm); // create PTEs
   => `kvmmap` is a simple overlay for this function (automatically calls `panic` if an error occurs)
   => uses walk to find a PTE based on a virt@ (`walk` can also allocate a page in a page table)
  pte_t * walk(pagetable_t pagetable, uint64 va, int alloc); // virt@ -> PTE (can allocate it if 'alloc' is set)
   => return the address of the PTE in the lowest layer in the tree
  uint64 walkaddr(pagetable_t pagetable, uint64 va); // virt@ -> phy@
  void proc_mapstacks(pagetable_t kpgtbl); // allocate a kernel stack for a process


3.8 Code: exec

exec = syscall replacing a process's user address space with data read from a file
     = related files: kernel/{elf.h,exec.c}

  1. open a file with `namei`
  2. read ELF header (see kernel/elf.h for more info) matching structure `elfhdr`
  3. read subsequent ELF section headers corresponding to the `proghdr` structure
     each of them describing a part of the application that must be loaded into memory
     xv6 only has two program section headers: instructions and data

How exec works

  1. check if the file actually is an ELF binary
  2. allocate a new page table with no user mapping with `proc_pagetable` (kernel/exec.c:49)
  3. allocate memory for each ELF segment with `uvmalloc` (kernel/exec.c:65)
  4. load each segment with `loadseg` (kernel/exec.c:10)
     `loadseg` uses `readi` to read from the file
     `loadseg` uses `walkaddr` to find the phy@ of the allocated memory at which to write ELF segments

To read sections from a binary: objdump -p file

DEVICE TREE (devicetree, DT or DTB for device tree blob or even Flattened Devicetree Blob)
  => file representing the hardware so the kernel can initialize devices and provide them to the users.

Before DTs:
  hardware was either hardcoded (with each system-on-chip being listed and their devices enumerated) or
  detected with A LOT OF plateform-specific features such as BIOS, EFI, ACPI, etc.

DTs are an improvement over ACPI since ACPI isn't properly standardized and implementations are often buggy.

IN QEMU: this file can be created by QEMU and loaded by a bootloader somewhere in RAM, or provided to qemu.
IN REAL LIFE: this file may be generated statically then loaded by the
bootloader (or EFI?) in RAM then the address is provided to the kernel.
Real kernels will get their configuration from different sources anyway.