185 lines
9.6 KiB
Text
185 lines
9.6 KiB
Text
Pages tables enables each process to have its own private memory space.
|
|
It's slicing the memory into small pages (PGSIZE, 4KiB) so we can
|
|
distribute it to many processes without much fragmentation, and enables
|
|
xv6 a few tricks:
|
|
|
|
- mapping the same memory (a trampoline page) in several address spaces
|
|
- guarding kernel and user stacks with an unmapped page
|
|
|
|
3.1 Paging hardware
|
|
|
|
|
|
RISC-V instruction (both user and kernel) manipulate virtual addresses.
|
|
Physical memory (RAM) = physical addresses (phy@ in the notes).
|
|
Virtual memory = fake memory addresses (virt@ in the notes).
|
|
RISC-V page table hardware maps virt@ and phy@.
|
|
|
|
XV6 = Sv39 RISC-V = only bottom 39 bits for virt@ (top 25 bits are not used).
|
|
2^39 @ = 2^27 page table entries (PTEs)
|
|
a PTE = 44 bits of physical page number (PPN) + flags
|
|
|
|
vocabulary, acronyms, et caetera:
|
|
- physical/virtual addresses = phy@/virt@
|
|
- physical address (what the computer can address) = 2^56 = 65536 TiB
|
|
- virtual address (what a single process can address) = 2^39 = 512 GiB
|
|
- Page Table Entry = PTE
|
|
- Physical Page Number = PPN
|
|
- Page Directory = page table is split into 3 "small page tables" called page directories
|
|
- Translation Look-aside Buffer = TLB
|
|
- Supervisor Address Translation and Protection = satp (it's a register)
|
|
- address space = set of valid virtual addresses in a given page table
|
|
= the kernel also has its own address space
|
|
- User memory = its address space + physical memory allowed by the page table
|
|
- Virtual memory = ideas and techniques associated with managing page tables
|
|
used to achieve isolation & such
|
|
- direct mapping = virt@ == phy@
|
|
- KVM/UVM = Kernel/User virtual memory
|
|
- a page table = 512 PTEs = can be contained into 1 memory page
|
|
|
|
virt@ = [ 25-bit EXT ; 27-bit index ; 12-bit offset ]
|
|
↑64 ↑39 ↑12 ↑0
|
|
index = index to the PPN in the page table
|
|
page table = 2^27 entries
|
|
page table entry = [ 44-bit PPN ; 10-bit flags ]
|
|
↑54 ↑10 ↑0
|
|
phy@ = [ 44-bit PPN (indexed by virt@ index) ; 12-bit virt@ offset ]
|
|
↑56 ↑12 ↑0
|
|
|
|
virt@ = 39 (usable) bits, phy@ = 56 bits
|
|
|
|
Paging hardware translates virt@ with its top 27 of the 39 bits to find a PTE
|
|
|
|
IN REAL LIFE there the page table is split into 3 small tables, the index
|
|
is split into 3*9 bits to be used as indexes for these 3 tables.
|
|
These 3 parts can be called "directories":
|
|
1. a "root" (a 4096-byte page table)
|
|
contains 512 PTEs which contain phy@ for next level directory
|
|
2. a "middle" (idem)
|
|
3. a "final"
|
|
In Sv48, there is an extra page table before "root" which takes bits 39
|
|
through 47 of a virt@.
|
|
|
|
In case the 3 PTEs points to an invalid page, the page hardware raises a
|
|
"page-fault exception" (execeptions explained in chapter 4).
|
|
|
|
RISC-V CPU caches page table entries in a Translation Look-aside Buffer
|
|
(TLB) to avoid costly loads of the PTE from the physical memory.
|
|
|
|
PTE flags:
|
|
- PTE_V valid (is the PTE is present or not)
|
|
- PTE_U user mode (if not present, can only be used in supervisor mode)
|
|
- PTE_R read: can the instructions read the page?
|
|
- PTE_W write: can the instructions write the page?
|
|
- PTE_X execute: can the CPU interpret the page's content as instructions and execute them?
|
|
|
|
satp register = where to put the phy@ of the root page table to be used by the CPU.
|
|
Once satp, subsequent instructions are interpreted with the provided page table.
|
|
Before setting satp, instructions use phy@.
|
|
Each CPU its own satp so each CPU can handle user code.
|
|
|
|
XV6 = 1 page table per process and 1 page table for the kernel
|
|
XV6 kernel page table:
|
|
- direct mapping for most pages
|
|
- no direct mapping for the trampoline page
|
|
- no direct mapping for stacks' pages
|
|
=> these pages are related to the processes (`kstack` in the `proc` structure)
|
|
=> the kstack is followed by an invalid guard page (PTE_V not set)
|
|
to prevent memory corruption from stack overflows
|
|
|
|
Virtual memory functions:
|
|
- walk: find PTE for a virt@ (can allocate a PTE in a page table)
|
|
- mappages: install PTEs for new mappings
|
|
- kvm_* = kernel virtual memory functions (kernel page table)
|
|
- uvm_* = same but for a user process
|
|
- copyin = copy data from user space to kernel space
|
|
- copyout = copy data from kernel space to user space
|
|
- copyinstr = copy a null-terminated string from user to kernel
|
|
= used for paths given in syscalls for example
|
|
- kvminit sets the root kernel page table created by kvmmake
|
|
- kvmmake creates a direct-map page table for the kernel
|
|
1. create the root kernel page table with a call to `kalloc`
|
|
(kalloc provides a pointer to a page, which is the type `pagetable_t`)
|
|
2. call kvmmap (an overlay of `mappages` to handle errors) multiple times to set a few direct-map pages
|
|
kvmmap adds mapping to the kernel page table (when booting only), doesn't flush TLB or enable paging
|
|
mapped stuff:
|
|
uart registers, virtio mmio disk interface, PLIC, kernel text and data
|
|
and trampoline (for trap entry/exit) is mapped to the highest virtual address in the kernel
|
|
3. proc_mapstacks allocates a kernel stack for each process in the `proc` (static) array of processes in `proc.c`
|
|
each kernel stack page is placed under the TRAMPOLINE kernel page with a following guard page (invalid page)
|
|
|
|
side note: the function is complex for no reason, it uses pointer arithmetics just to get an index (0-7)
|
|
|
|
TRAMPOLINE is a macro to get the phy@ of the trampoline page (MAXVA - PGSIZE)
|
|
KSTACK(p) is a macro to place a kernel page bellow the trampoline page with a guard page
|
|
KSTACK(p) = (TRAMPOLINE - ((p)+1)* 2*PGSIZE)
|
|
|
|
TRAMPOLINE & KSTACK are macros in memlayout.h and are use in proc_mapstacks
|
|
- main calls kvminithart which sets satp to the kernel root page table so the CPU can start using it
|
|
- each CPU caches PTEs in a Translation Look-aside Buffer (TLB)
|
|
=> when xv6 changes a page table it must tell the CPU to invalidate cached entry in the TLB
|
|
=> RISC-V has an `sfence.vma` instruction to flush the current CPU's TLB
|
|
=> `kvminithart` uses it after initializing sapt
|
|
=> `sfence.vma` is also called before setting sapt
|
|
"to ensure that preceding updates to the page table have completed,
|
|
and ensures that preceding loads and stores use the old page table,
|
|
not the new one"
|
|
=> `TRAMPOLINE` page uses it before entering user space
|
|
=> RISC-V CPUs can have different TLBs for different address spaces
|
|
=> avoid flushing an entire TLB
|
|
=> xv6 doesn't use this feature
|
|
|
|
Physical memory management consists of handling memory pages from the `kmem.freelist` table.
|
|
An allocation is materialized by removing an entry from this table,
|
|
freeing a page is about adding back the page to the list.
|
|
|
|
sbrk is implemented with the function `growproc` (in prog.c) which calls either uvmalloc or uvmdealloc.
|
|
|
|
Function signatures (for reference):
|
|
void kvminit(void); // set the kernel root page table
|
|
pagetable_t kvmmake(void); // create the kernel page table
|
|
void kvmmap(pagetable_t, uint64 virt@, uint64 phy@, uint64 sz, int perm); // add PTEs to the kernel page table
|
|
=> `kvmmap` is a simple overlay for `mappages` (automatically calls `panic` if an error occurs)
|
|
int mappages(pagetable_t, uint64 virt@, uint64 size, uint64 phy@, int perm); // create PTEs
|
|
=> `kvmmap` is a simple overlay for this function (automatically calls `panic` if an error occurs)
|
|
=> uses walk to find a PTE based on a virt@ (`walk` can also allocate a page in a page table)
|
|
pte_t * walk(pagetable_t pagetable, uint64 va, int alloc); // virt@ -> PTE (can allocate it if 'alloc' is set)
|
|
=> return the address of the PTE in the lowest layer in the tree
|
|
uint64 walkaddr(pagetable_t pagetable, uint64 va); // virt@ -> phy@
|
|
void proc_mapstacks(pagetable_t kpgtbl); // allocate a kernel stack for a process
|
|
|
|
|
|
3.8 Code: exec
|
|
|
|
exec = syscall replacing a process's user address space with data read from a file
|
|
= related files: kernel/{elf.h,exec.c}
|
|
|
|
1. open a file with `namei`
|
|
2. read ELF header (see kernel/elf.h for more info) matching structure `elfhdr`
|
|
3. read subsequent ELF section headers corresponding to the `proghdr` structure
|
|
each of them describing a part of the application that must be loaded into memory
|
|
xv6 only has two program section headers: instructions and data
|
|
|
|
How exec works
|
|
|
|
1. check if the file actually is an ELF binary
|
|
2. allocate a new page table with no user mapping with `proc_pagetable` (kernel/exec.c:49)
|
|
3. allocate memory for each ELF segment with `uvmalloc` (kernel/exec.c:65)
|
|
4. load each segment with `loadseg` (kernel/exec.c:10)
|
|
`loadseg` uses `readi` to read from the file
|
|
`loadseg` uses `walkaddr` to find the phy@ of the allocated memory at which to write ELF segments
|
|
|
|
To read sections from a binary: objdump -p file
|
|
|
|
DEVICE TREE (devicetree, DT or DTB for device tree blob or even Flattened Devicetree Blob)
|
|
=> file representing the hardware so the kernel can initialize devices and provide them to the users.
|
|
|
|
Before DTs:
|
|
hardware was either hardcoded (with each system-on-chip being listed and their devices enumerated) or
|
|
detected with A LOT OF plateform-specific features such as BIOS, EFI, ACPI, etc.
|
|
|
|
DTs are an improvement over ACPI since ACPI isn't properly standardized and implementations are often buggy.
|
|
|
|
IN QEMU: this file can be created by QEMU and loaded by a bootloader somewhere in RAM, or provided to qemu.
|
|
IN REAL LIFE: this file may be generated statically then loaded by the
|
|
bootloader (or EFI?) in RAM then the address is provided to the kernel.
|
|
Real kernels will get their configuration from different sources anyway.
|