Friday 25 April 2008

Reading 6. Lock Elision

Why do we need it?

In multithreaded programs, synchronization mechanism - usually locks - are often used to guarantee threads have exclusive access to shared data for a critical section of code. A thread acquires the lock, executes its critical section, and release the lock. The key insight is that locks do not always have to be acquired for a correct execution. The following code demonstrates this fact:

Thread 1

LOCK(hash_tbl.lock)
var = hash_tbl.lookup(X)
if (!var)
hash_tbl.add(X);
UNLOCK(hash_tbl.lock)


Thread 2

LOCK(hash_tbl.lock)
var = hash_tbl.lookup(Y)

if (!var)

hash_tbl.add(Y);

UNLOCK(hash_tbl.lock)


As we can see here this can be executed without any lock, because threads are updating different parts of the shared object.
While talking about the trade-offs of the multithreaded programming one can point:
  1. Conservative locking. To ensure correctness, programmers rely on conservative locking, often at the expense of performance.
  2. Granularity. Careful program design must choose appropriate levels of locking to optimize the tradeoff between performance and ease of reasoning about program correctness.
  3. Thread-unsafe libraries. If a thread calls a library not equipped to deal with threads, a global locking mechanism is used to prevent conflicts. And again the performance problem.
The solution is obvious now:
  1. Programmers use frequent and conservative synchronization to write correct multithreaded programs.
  2. Synchronization instructions are predicted as being unnecessary and elided.
  3. There is no system-level modification, only hardware support needed

Speculative Lock Elision

How the lock is constructed? Let's look at typical code for the lock-acquire and release using the Alpha ISA:

Ll: l. ldl tO, O(tl) # tO = lock
2. bne tO, LI: #if not free, goto L1
3. ldl_l tO, O(tl) #load locked, tO = lock
4. bne tO, LI: #if not free, goto L1
5. lda tO, 1(O) #tO = 1
6. stl_c tO, O(tl) #conditional store, lock = 1
7. beq tO, Li: #if stlc failed, goto Li,

8-15.

16. stl 0,0(tl) #lock = O, release lock

ldl_l/stl_c is the pair of so called LOAD-LOCKED/STORE-CONDITIONAL (ll/sc) instructions. ll reads the data and sc writes, if from the moment of reading the location nobody writes to it.

As it was said before the key observation is that a lock does not always have to be acquired for a correct execution if hardware can provide the appearance of atomicity for all memory operations within the critical section. If a data conflict occurs, i.e., two threads compete for the same data other than for reading, atomicity cannot be guaranteed and the lock need to be acquired. Data conflicts among threads are detected using existing cache coherence protocol.

To understand how point the critical section lets just look at the listing of the instructions above. The lock elision can be done by simply observing load and store sequences and the values read and to be written. We should use a filter to determine candidate load/store pairs. For example, in this implementation, only instructions tdl_t and stl c (they normally occur as a pair) are consider.

So the algorithm is the following:

  1. If candidate load (ldl_l) to an address is followed by store (stl_c of the lock acquire) to same address, predict another store (lock release) will shortly follow, restoring the memory location value to the one prior to this store (stl_c of the lock acquire).
  2. Predict memory operations in critical sections will occur atomically, and elide lock acquire.
  3. Execute critical section speculatively and buffer results.
  4. If hardware cannot provide atomicity, trigger misspeculation, recover and explicitly acquire lock.
  5. If second store (lock release) of step 1 seen, atomicity was not violated (else a misspeculation would have been triggered earlier). Elide lock-release store, commit state, and exit speculative critical section.


Some words about implementing SLE

To recover from an SLE misspeculation, register and memory state must be buffered until SLE is validated.
Two techniques for handling register state:
  1. Reorder buffer (ROB): each instruction is stored in buffer in the execution order, also there is a place in buffer for the instruction to store the result. Pay attention that the size of the ROB places a limit on the size of the critial section (for more details of the ROB look - J.E.Smith & A.R.Pleszkun. Implementation of Precise Interrupts in Pipeline processors)
  2. Register checkpoint: the idea is that a copy is done at the moment we elide the lock. This may be of dependence maps or of the architected register state itself. Instructions can be safely update the register file, speculatively retired, and be removed from the ROB because a correct architected register checkpoint exists for the recovery.
To store the memory state:
Although most modern processors support speculative load execution, they do not retire stores speculatively (i.e., write to the memory system speculatively). For supporting SLE, we augment existing processor write-buffers (between the processor and L1 cache) to buffer speculative memory updates. Speculative data is not committed from the write-buffer into the lower memory hierarchy until the lock elision is validated. On a
misspeculation, speculative entries in the write-buffer are invalidated.

To detects the conflicts the cache coherence protocol is used. Invalidation-based coherence protocol guarantee an exclusive copy of the memory block in the local cache when a store is performed. But the mechanism to record memory addresses read and write within the critical section is needed:
  1. If the ROB approach is used for SLE, no additional mechanisms are required for tracking external writes to memory locations speculatively read--the LSQ is already snooped.
  2. If the register checkpoint approach is used, the LSQ alone cannot be used to detect load violations for SLE because loads may speculatively retire and leave the ROB. In this case, the cache can be augmented with an access bit for each cache block. Every memory access executed during SLE marks the access bit of the corresponding block. On an external request, this bit is checked in parallel with the cache tags. Any external invalidation to a block with its access bit set, or an external request to a block in exclusive state with its access bit set, triggers a misspeculation. The bit can be stored with the tags and there is no overhead for the bit comparison because, for maintaining coherency, a tag lookup is already performed to snoop the cache.
Resource constraints or when we should use locks

Limited resources may force a misspeculation if either there is not enough buffer space to store speculative updates, or it is not possible to monitor accessed data to provide atomicity guarantees. Four such conditions for misspeculation are:
  1. Finite cache size. If register checkpoint is used, the cache may not be able to track all memory accesses.
  2. Finite write-buffer size. The number of unique cache lines modified exceeds the write-buffer size.
  3. Finite ROB size. If the checkpoint approach is used, the ROB size is not a problem.
  4. Uncached accesses or events (e.g., some system calls) where the processor cannot track requests.

Transactional Lock Removal

The main idea is to use SLE for lock-free execution but to use timestamp-based fair conflict resolution instead of locks (though some stituations - like recources constraints - still need locks).
The algorithm is based on Lamport clocks (L.Lamport. Time, clocks, and the ordering of events in a distributed system). Timestamps are used for resolving conflicts to decide a conflict winner - earlier timestamp implies higher priority.

The algorithm
  1. Calculate local timestamp
  2. Identify transaction start:
    • Initiate TLR mode (use SLE to elide locks)
    • Execute transaction speculatively.
  3. During transactional speculative execution
    • Locally buffer speculative updates
    • Append timestamp to all outgoing requests
    • If incoming request conflicts with retainable block and has later timestamp, retain ownership and force requestor to wait
    • If incoming request conflicts with retainable block and has earlier timestamp, service request and restart from step 2b if necessary. Give up any retained ownerships.
    • If insufficient resources, acquire lock.
      • No buffer space
      • Operation cannot be undone (e.g., I/O)
  4. Identify transaction end:
    • If all blocks available in local cache in appropriate coherence state, atomically commit memory updates from local buffer into cache (write to cache using SLE).
    • Commit transaction register (processor) state.
    • Service waiters if any
    • Update local timestamp

Propagating priority information

Let us consider now the following situation:
  1. There are 3 processors - P0, P1, P2
  2. They have priorities: P0>P1>P2
  3. P0 has an exclusive ownership for a block A
  4. P1 has an exclusive ownership for a block B
  5. At time t1 P1 send a request for A, that goes to Po (because now he is the owner)
  6. P0 receives the response, resolve conflict and so (P0>P1) wins it and defers the response
  7. Now P1 exclusively owns block A (because P1's request has been ordered by the protocol) but data (and write permission) is with P0 still
  8. P1 is waiting for P0 for cache block A
  9. P2 sends a request for P1 for the block B
  10. Situation is just the same as above - P2 owns B exclusively, but write permission is still with P1
  11. P2 is awaiting for P1
  12. P0 sends a request for B and it's forwarded to P2 (that owns B, though does not have data itself)
  13. P2 loses the conflict and should give block B to P0
  14. But P2 is awaiting for P1 to release the block B and P1 is awaiting fot P0 to release the block A
  15. deadlock!!!
Now let's think over the situation. The deadlock happened because the information about priorities was not propogated along the cache coherence protocol! To solve the problem we need the following:
  1. Marker message - directed message sent in responce to a request for a block under conflict for which data is not provided immediately.
  2. Probes - are used to propagate a conflict request upstream in a cache coherence protocol chain. Thus, when P2 receives P0's request for B, P2 forwards the probe (with P0's timestamp) to P1 since P2 received a marker message from P1. PI receives P0's forwarded probe (via P2) and loses the conflict because P0 has higher priority than P1. P1 releases ownership of block B and the cyclic wait is broken.