322 lines
		
	
	
	
		
			11 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			322 lines
		
	
	
	
		
			11 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <title>L7</title>
 | |
| <html>
 | |
| <head>
 | |
| </head>
 | |
| <body>
 | |
| 
 | |
| <h1>Locking</h1>
 | |
| 
 | |
| <p>Required reading: spinlock.c
 | |
| 
 | |
| <h2>Why coordinate?</h2>
 | |
| 
 | |
| <p>Mutual-exclusion coordination is an important topic in operating
 | |
| systems, because many operating systems run on
 | |
| multiprocessors. Coordination techniques protect variables that are
 | |
| shared among multiple threads and updated concurrently.  These
 | |
| techniques allow programmers to implement atomic sections so that one
 | |
| thread can safely update the shared variables without having to worry
 | |
| that another thread intervening.  For example, processes in xv6 may
 | |
| run concurrently on different processors and in kernel-mode share
 | |
| kernel data structures.  We must ensure that these updates happen
 | |
| correctly.
 | |
| 
 | |
| <p>List and insert example:
 | |
| <pre>
 | |
| 
 | |
| struct List {
 | |
|   int data;
 | |
|   struct List *next;
 | |
| };
 | |
| 
 | |
| List *list = 0;
 | |
| 
 | |
| insert(int data) {
 | |
|   List *l = new List;
 | |
|   l->data = data;
 | |
|   l->next = list;  // A
 | |
|   list = l;        // B
 | |
| }
 | |
| </pre>
 | |
| 
 | |
| <p>What needs to be atomic? The two statements labeled A and B should
 | |
| always be executed together, as an indivisible fragment of code.  If
 | |
| two processors execute A and B interleaved, then we end up with an
 | |
| incorrect list.  To see that this is the case, draw out the list after
 | |
| the sequence A1 (statement executed A by processor 1), A2 (statement A
 | |
| executed by processor 2), B2, and B1.
 | |
| 
 | |
| <p>How could this erroneous sequence happen? The varilable <i>list</i>
 | |
| lives in physical memory shared among multiple processors, connected
 | |
| by a bus.  The accesses to the shared memory will be ordered in some
 | |
| total order by the bus/memory system.  If the programmer doesn't
 | |
| coordinate the execution of the statements A and B, any order can
 | |
| happen, including the erroneous one.
 | |
| 
 | |
| <p>The erroneous case is called a race condition.  The problem with
 | |
| races is that they are difficult to reproduce.  For example, if you
 | |
| put print statements in to debug the incorrect behavior, you might
 | |
| change the time and the race might not happen anymore.
 | |
| 
 | |
| <h2>Atomic instructions</h2>
 | |
| 
 | |
| <p>The programmer must be able express that A and B should be executed
 | |
| as single atomic instruction.  We generally use a concept like locks
 | |
| to mark an atomic region, acquiring the lock at the beginning of the
 | |
| section and releasing it at the end: 
 | |
| 
 | |
| <pre> 
 | |
| void acquire(int *lock) {
 | |
|    while (TSL(lock) != 0) ; 
 | |
| }
 | |
| 
 | |
| void release (int *lock) {
 | |
|   *lock = 0;
 | |
| }
 | |
| </pre>
 | |
| 
 | |
| <p>Acquire and release, of course, need to be atomic too, which can,
 | |
| for example, be done with a hardware atomic TSL (try-set-lock)
 | |
| instruction:
 | |
| 
 | |
| <p>The semantics of TSL are:
 | |
| <pre>
 | |
|    R <- [mem]   // load content of mem into register R
 | |
|    [mem] <- 1   // store 1 in mem.
 | |
| </pre>
 | |
| 
 | |
| <p>In a harware implementation, the bus arbiter guarantees that both
 | |
| the load and store are executed without any other load/stores coming
 | |
| in between.
 | |
| 
 | |
| <p>We can use locks to implement an atomic insert, or we can use
 | |
| TSL directly:
 | |
| <pre>
 | |
| int insert_lock = 0;
 | |
| 
 | |
| insert(int data) {
 | |
| 
 | |
|   /* acquire the lock: */
 | |
|   while(TSL(&insert_lock) != 0)
 | |
|     ;
 | |
| 
 | |
|   /* critical section: */
 | |
|   List *l = new List;
 | |
|   l->data = data;
 | |
|   l->next = list;
 | |
|   list = l;
 | |
| 
 | |
|   /* release the lock: */
 | |
|   insert_lock = 0;
 | |
| }
 | |
| </pre>
 | |
| 
 | |
| <p>It is the programmer's job to make sure that locks are respected.  If
 | |
| a programmer writes another function that manipulates the list, the
 | |
| programmer must must make sure that the new functions acquires and
 | |
| releases the appropriate locks.  If the programmer doesn't, race
 | |
| conditions occur.
 | |
| 
 | |
| <p>This code assumes that stores commit to memory in program order and
 | |
| that all stores by other processors started before insert got the lock
 | |
| are observable by this processor.  That is, after the other processor
 | |
| released a lock, all the previous stores are committed to memory.  If
 | |
| a processor executes instructions out of order, this assumption won't
 | |
| hold and we must, for example, a barrier instruction that makes the
 | |
| assumption true.
 | |
| 
 | |
| 
 | |
| <h2>Example: Locking on x86</h2>
 | |
| 
 | |
| <p>Here is one way we can implement acquire and release using the x86
 | |
| xchgl instruction:
 | |
| 
 | |
| <pre>
 | |
| struct Lock {
 | |
|   unsigned int locked;
 | |
| };
 | |
| 
 | |
| acquire(Lock *lck) {
 | |
|   while(TSL(&(lck->locked)) != 0)
 | |
|     ;
 | |
| }
 | |
| 
 | |
| release(Lock *lck) {
 | |
|   lck->locked = 0;
 | |
| }
 | |
| 
 | |
| int
 | |
| TSL(int *addr)
 | |
| {
 | |
|   register int content = 1;
 | |
|   // xchgl content, *addr
 | |
|   // xchgl exchanges the values of its two operands, while
 | |
|   // locking the memory bus to exclude other operations.
 | |
|   asm volatile ("xchgl %0,%1" :
 | |
|                 "=r" (content),
 | |
|                 "=m" (*addr) :
 | |
|                 "0" (content),
 | |
|                 "m" (*addr));
 | |
|   return(content);
 | |
| }
 | |
| </pre>
 | |
| 
 | |
| <p>the instruction "XCHG %eax, (content)" works as follows:
 | |
| <ol>
 | |
| <li> freeze other CPUs' memory activity
 | |
| <li> temp := content
 | |
| <li> content := %eax
 | |
| <li> %eax := temp
 | |
| <li> un-freeze other CPUs
 | |
| </ol>
 | |
| 
 | |
| <p>steps 1 and 5 make XCHG special: it is "locked" special signal
 | |
|   lines on the inter-CPU bus, bus arbitration
 | |
| 
 | |
| <p>This implementation doesn't scale to a large number of processors;
 | |
|   in a later lecture we will see how we could do better.
 | |
| 
 | |
| <h2>Lock granularity</h2>
 | |
| 
 | |
| <p>Release/acquire is ideal for short atomic sections: increment a
 | |
| counter, search in i-node cache, allocate a free buffer.
 | |
| 
 | |
| <p>What are spin locks not so great for?  Long atomic sections may
 | |
|   waste waiters' CPU time and it is to sleep while holding locks. In
 | |
|   xv6 we try to avoid long atomic sections by carefully coding (can
 | |
|   you find an example?).  xv6 doesn't release the processor when
 | |
|   holding a lock, but has an additional set of coordination primitives
 | |
|   (sleep and wakeup), which we will study later.
 | |
| 
 | |
| <p>My list_lock protects all lists; inserts to different lists are
 | |
|   blocked. A lock per list would waste less time spinning so you might
 | |
|   want "fine-grained" locks, one for every object BUT acquire/release
 | |
|   are expensive (500 cycles on my 3 ghz machine) because they need to
 | |
|   talk off-chip.
 | |
| 
 | |
| <p>Also, "correctness" is not that simple with fine-grained locks if
 | |
|   need to maintain global invariants; e.g., "every buffer must be on
 | |
|   exactly one of free list and device list".  Per-list locks are
 | |
|   irrelevant for this invariant.  So you might want "large-grained",
 | |
|   which reduces overhead but reduces concurrency.
 | |
| 
 | |
| <p>This tension is hard to get right. One often starts out with
 | |
|   "large-grained locks" and measures the performance of the system on
 | |
|   some workloads. When more concurrency is desired (to get better
 | |
|   performance), an implementor may switch to a more fine-grained
 | |
|   scheme. Operating system designers fiddle with this all the time.
 | |
| 
 | |
| <h2>Recursive locks and modularity</h2>
 | |
| 
 | |
| <p>When designing a system we desire clean abstractions and good
 | |
|   modularity. We like a caller not have to know about how a callee
 | |
|   implements a particul functions.  Locks make achieving modularity
 | |
|   more complicated.  For example, what to do when the caller holds a
 | |
|   lock, then calls a function, which also needs to the lock to perform
 | |
|   its job.
 | |
| 
 | |
| <p>There are no transparent solutions that allow the caller and callee
 | |
|   to be unaware of which lokcs they use. One transparent, but
 | |
|   unsatisfactory option is recursive locks: If a callee asks for a
 | |
|   lock that its caller has, then we allow the callee to proceed.
 | |
|   Unfortunately, this solution is not ideal either.
 | |
| 
 | |
| <p>Consider the following. If lock x protects the internals of some
 | |
|   struct foo, then if the caller acquires lock x, it know that the
 | |
|   internals of foo are in a sane state and it can fiddle with them.
 | |
|   And then the caller must restore them to a sane state before release
 | |
|   lock x, but until then anything goes.
 | |
| 
 | |
| <p>This assumption doesn't hold with recursive locking.  After
 | |
|   acquiring lock x, the acquirer knows that either it is the first to
 | |
|   get this lock, in which case the internals are in a sane state, or
 | |
|   maybe some caller holds the lock and has messed up the internals and
 | |
|   didn't realize when calling the callee that it was going to try to
 | |
|   look at them too.  So the fact that a function acquired the lock x
 | |
|   doesn't guarantee anything at all. In short, locks protect against
 | |
|   callers and callees just as much as they protect against other
 | |
|   threads.
 | |
| 
 | |
| <p>Since transparent solutions aren't ideal, it is better to consider
 | |
|   locks part of the function specification. The programmer must
 | |
|   arrange that a caller doesn't invoke another function while holding
 | |
|   a lock that the callee also needs.
 | |
| 
 | |
| <h2>Locking in xv6</h2>
 | |
| 
 | |
| <p>xv6 runs on a multiprocessor and is programmed to allow multiple
 | |
| threads of computation to run concurrently.  In xv6 an interrupt might
 | |
| run on one processor and a process in kernel mode may run on another
 | |
| processor, sharing a kernel data structure with the interrupt routing.
 | |
| xv6 uses locks, implemented using an atomic instruction, to coordinate
 | |
| concurrent activities.
 | |
| 
 | |
| <p>Let's check out why xv6 needs locks by following what happens when
 | |
| we start a second processor:
 | |
| <ul>
 | |
| <li>1516: mp_init (called from main0)
 | |
| <li>1606: mp_startthem (called from main0)
 | |
| <li>1302: mpmain
 | |
| <li>2208: scheduler.
 | |
|   <br>Now we have several processors invoking the scheduler
 | |
|   function. xv6 better ensure that multiple processors don't run the
 | |
|   same process!  does it?
 | |
|   <br>Yes, if multiple schedulers run concurrently, only one will
 | |
|   acquire proc_table_lock, and proceed looking for a runnable
 | |
|   process. if it finds a process, it will mark it running, longjmps to
 | |
|   it, and the process will release proc_table_lock.  the next instance
 | |
|   of scheduler will skip this entry, because it is marked running, and
 | |
|   look for another runnable process.
 | |
| </ul>
 | |
| 
 | |
| <p>Why hold proc_table_lock during a context switch?  It protects
 | |
| p->state; the process has to hold some lock to avoid a race with
 | |
| wakeup() and yield(), as we will see in the next lectures.
 | |
| 
 | |
| <p>Why not a lock per proc entry?  It might be expensive in in whole
 | |
| table scans (in wait, wakeup, scheduler). proc_table_lock also
 | |
| protects some larger invariants, for example it might be hard to get
 | |
| proc_wait() right with just per entry locks.  Right now the check to
 | |
| see if there are any exited children and the sleep are atomic -- but
 | |
| that would be hard with per entry locks.  One could have both, but
 | |
| that would probably be neither clean nor fast.
 | |
| 
 | |
| <p>Of course, there is only processor searching the proc table if
 | |
| acquire is implemented correctly. Let's check out acquire in
 | |
| spinlock.c:
 | |
| <ul>
 | |
| <li>1807: no recursive locks!
 | |
| <li>1811: why disable interrupts on the current processor? (if
 | |
| interrupt code itself tries to take a held lock, xv6 will deadlock;
 | |
| the panic will fire on 1808.)
 | |
| <ul>
 | |
| <li>can a process on a processor hold multiple locks?
 | |
| </ul>
 | |
| <li>1814: the (hopefully) atomic instruction.
 | |
| <ul>
 | |
| <li>see sheet 4, line 0468.
 | |
| </ul>
 | |
| <li>1819: make sure that stores issued on other processors before we
 | |
| got the lock are observed by this processor.  these may be stores to
 | |
| the shared data structure that is protected by the lock.
 | |
| </ul>
 | |
| 
 | |
| <p>
 | |
| 
 | |
| <h2>Locking in JOS</h2>
 | |
| 
 | |
| <p>JOS is meant to run on single-CPU machines, and the plan can be
 | |
| simple.  The simple plan is disabling/enabling interrupts in the
 | |
| kernel (IF flags in the EFLAGS register).  Thus, in the kernel,
 | |
| threads release the processors only when they want to and can ensure
 | |
| that they don't release the processor during a critical section.
 | |
| 
 | |
| <p>In user mode, JOS runs with interrupts enabled, but Unix user
 | |
| applications don't share data structures.  The data structures that
 | |
| must be protected, however, are the ones shared in the library
 | |
| operating system (e.g., pipes). In JOS we will use special-case
 | |
| solutions, as you will find out in lab 6.  For example, to implement
 | |
| pipe we will assume there is one reader and one writer.  The reader
 | |
| and writer never update each other's variables; they only read each
 | |
| other's variables.  Carefully programming using this rule we can avoid
 | |
| races.
 | 
