Sunday 19 October 2008

Reading 7. HTM. Herlihy and Moss 1993. Transactional memory

Herlihy and Moss, ISCA 1993: Transaction memory

This paper coined the term transactional memory, and identified the use of cache mechanisms for performing optimistic synchronization to multiple memory locations in a shared-memory multiprocessor. This was in contrast to the Oklahoma Update proposal that used reservation registers instead of a transactional cache.

Programming Interface
The proposal introduces six new instructions: load-transactional, load-transactional-exclusive, store-transactional, commit, abort, and validate (all those commands are pretty self-describing). The programmer uses these instructions for lock-free data structures and was responsible for saving register state and for ensuring forward progress. Transactions is expected to be short lived and complete in one scheduling quantum.

The following code sequence demonstrates the use of the new instructions to insert an element into a doubly linked list:

// Usage of new instructions to construct data structure
typedef struct list_elem {
  struct list_elem *next;
  struct list_elem *prev;
  int value;
} entry;

entry *Head, *Tail;

void Enqueue(entry* new) {
  entry *old_tail;
  unsigned backoff = BACKOFF_MIN;
  unsigned wait;
  new->next = new->prev = NULL;
  while (TRUE) {
    old_tail = (entry*) LOAD_TRANSACTIONAL_EXCLUSIVE(&Tail);
    if (VALIDATE()) { // ensure transaction still valid
      STORE_TRANSACTIONAL(&new->prev, old_tail);
      if (old_tail == NULL) {
        STORE_TRANSACTIONAL(&Head, new);
      } else {
        STORE_TRANSACTIONAL(&old_tail->next, new);
      }

      STORE_TRANSACTIONAL(&Tail, new);
      if (COMMIT()) // try to commit
        return;
    }
    
    wait = random() % (01 << backoff);
    while (wait--);
    if (backoff < BACKOFF_MAX)
      backoff++;
  }
}

Implementation

A regular CPU is adjusted with fully-associative transactional cache just to store per-transaction data. Also processor equipped with two flags for saving transaction state. The next are not actually new: cache coherence protocol enhanced with handling new cache and flags, sniffing on other processors doings. The paper itself digs into complete description of that protocol including state machine.

As transactional cache is limited in size and can grew too large the authors proposed to use software emulation of larger cache. This allows using hardware for the common case and software for the exceptional case.

Classification

Strong or weak isolationStrong
Transaction granularityCache line
Direct or deferred updateDeferred (in cache)
Concurrency controlOptimistic
Conflict detectionEarly
Inconsistent readsYes
Conflict resolutionReceiver NACKs/Requestor software backoff
Nested transactionN/A

Materials
  • J.R. Larus, R.Rajwar Transactional Memory, 4.3.5
  • M. Herlihy and J. E. B. Moss, “Transactional memory: Architectural support for lock-free data structures,” In Proc. 20th Annu. Int. Symp. on Computer Architecture, pp. 289–300 May 1993

Friday 30 May 2008

Reading 8. Concurrent Haskell and HSTM

1. Gentle introduction to IO Monad in Haskell

Our following talk about model of Software Transactional Memory in Haskell will be meaningless without discussing concepts of concurrent Haskell.
Main ideas of concurrent Haskell were described in paper Concurrent Haskell by Simon Peyton Jones, Andrew Gordon and Sigbjorn Finne. In actual fact concurrent Haskell is simple extension of pure lazy-evaluated functional Haskell language. It adds two main new ingredients to Haskell:
  • processes, and a mechanism for process initiation
  • atomically-mutable state, to support inter-process communication and cooperation
Following the tradition it may seem strange to talk about some concurrency in pure functional languages like Haskell is because of concurrency concepts suppose existence of mutable entities and states. The most common way to emulate mutable state in Haskell is to wrap our computation result into Monad entity. From this point of view our program become a sequence of new monads creation. This approach helps also to emulate strict sequence of computations which is non-obvious in lazy-evaluated Haskell.
Let's recall what Monad is. Speaking strictly, Monad is tuple (M, return, >>=) where M is type designator, return is operator to wrap our computation into monad entity and >>= or bind operator represents monadic evaluations themselves. Their types are correspondingly

type Monad = M a
(>>=) :: M a → (a → M b) → M b
return :: a → M a

In some it is convenient to use reduced bind operator:

(>>) :: M a → M b → M b

In other words it ignores value wrapped into its first argument.
Most common examples of monads in Haskell are [], Maybe and IO. The last is most interesting for our purposes. In a non-strict language it is completely impractical to perform input/output using “side-effecting functions", because the order in which sub-expressions are evaluated | and indeed whether they are evaluated at all | is determined by the context in which the result of the expression is used, and hence is hard to predict. This difficulty can be addressed by treating an I/O-performing computation as a state transformer; that is, a function that transforms the current state of the world to a new state. In addition, we need the ability for an I/O-performing computation to return a result. This reasoning leads to the following type definition:

type IO a = World -> (a, World)

That is, a value of type IO t takes a world state as input and delivers a modified world state together with a value of type t. Of course, the implementation performs the I/O right away - thereby modifying the state of the world “in place".
We call a value of type IO t an action. Here are two useful complete actions:

hGetChar :: Handle -> IO Char
hPutChar :: Handle -> Char -> IO ()

The action hGetChar reads a character from the specified handle (which identifies some file or other byte stream) and returns it as the result of the action. hPutChar takes a handle and a character and return an action that writes the character to the specified stream.
Actions can be combined in sequences using infix combinators >> and >>= described above. For example here is an action that reads a character from the standard input, and then prints it twice to the standard output:

hGetChar stdin >>= \c ->
hPutChar stdout c >>
hPutChar stdout c

The notation \c->E, for some expression E, denotes a lambda abstraction. In Haskell, the scope of a lambda abstraction extends as far to the right as possible; in this example the body of the \c-abstraction includes everything after the \c. The sequencing combinators, >> and >>=, feed the result state of their left hand argument to the input of their right hand argument, thereby forcing the two actions (via the data dependency) to be performed in the correct order. The combinator >> throws away the result of its first argument, while >>= takes the result of its first argument and passes it on to its second argument. The similarity of monadic I/O-performing programs to imperative programs is no surprise: when performing I/O we specifically want to impose a total order on I/O operations.
It is often also useful to have an action that performs no I/O, and immediately returns a specified value using return operator. For example, an echo action that reads a character, prints it and returns the character read, might look like this:

echo :: IO Char
echo = hGetChar stdin >>= \c →
hPutChar stdout >>
return c

So, the resulting program which can be compiled to do something might look like

main :: IO ()
main = echo >>= \c ->
if c == '\n'
then return ()
else main

In principle, then, a program is just a state transformer that is applied to the real world to give a new world. In practice, however, it is crucial that the side-effects the program specifies are performed incrementally, and not all at once when the program finishes.

2. Processes in Haskell

Concurrent Haskell provides a new primitive forkIO, which starts a concurrent process:

forkIO :: IO () -> IO ()

forkIO is an action which takes an action, a, as its argument and spawns a concurrent process to perform that action. The I/O and other side effects performed by a are interleaved in an unspecified fashion with those that follow the forkIO. Here's an example:


let
-- loop ch prints an infinite sequence of ch's
loop ch = hPutChar stdout ch >> loop ch
in
forkIO (loop 'a') >>
loop 'z'

The forkIO spawns a process which perform the action loop 'a'. Meanwhile, the “parent“ process continues on to perform loop 'z'. The result is than an infinite sequence of interleaved 'a's and 'z's appears on the screen. The exact interleaving is unspecified.
As a more realistic example of forkIO in action a mail tool night incorporate the following loop:

mailLoop
= getButtonPress b >>= \ v ->
case v of
Compose -> forkIO doCompose >>
mailLoop
...other things
doCompose :: IO () -- Pop up and manage
doCompose = ... -- composition window

Here, getButtonPress is very like hGetChar; it awaits the next button press on button b, and then delivers a value indicating which button was pressed. This value is then treated by the case expression. If its value is Compose, then the action doCompose is forked to handle an independent composition window, while the main process continues with the next getButtonPress.
There are some interesting questions related to concurrency in Haskell.

1. Let's imagine that we have to evaluate value named 'c'. In common Haskell this value is represented internally by pointer to some closure which will be called and evaluate this value when someone will need in it. That's the famous Haskell's laziness.
Now each of concurrent processes may need in this value. Then the first who provoked 'c''s evaluation replaces pointer to 'c''s closure to some temporary object named thunk. Thunk indicates that value 'c' is currently evaluating. Other processes wait until evaluation ends.

2.
Since the parent and child processes may both mutate (parts of) the same shared state (namely, the world), forkIO immediately introduces non-determinism. For example, if one process decides to read a file, and the other deletes it, the effect of running the program will be unpredictable. While this non-determinism is not desirable, it is not avoidable; indeed, every concurrent language is non-deterministic. The only way to enforce determinism would be by somehow constraining the two processes to work on separate parts of the state (different files, in our example). The trouble is that essentially all the interesting applications of concurrency involve the deliberate and controlled mutation of shared state, such as screen real estate, the file system, or the internal data structures of the program. The right solution, therefore, is to provide mechanisms which allow (though alas they cannot enforce) the safe mutation of shared state section.

3. forkIO is asymmetrical: when a process executes a forkIO, it spawns a child process that executes concurrently with the continued execution of the parent. It would have been possible to design a symmetrical fork, an approach taken by Jones & Hudak:

symFork :: IO a -> IO b -> IO (a,b)

The idea here is symFork p1 p2 is an action that forks two processes, p1 and p2. When both complete, the symFork pairs their results together and returns this pair as its result. We rejected this approach because it forces us to synchronize on the termination of the forked process. If the desired behavior is that the forked process lives as long as it desires, then we have to provide the whole of the rest of the parent as the other argument to symFork, which is extremely inconvenient.

3. Synchronization and communication

To make our processes interact with each other and organize synchronization between them we introduce spectial type

type MVar a

This is simple memory cell which can contain any value of type a or be empty. We define following primitive operations on MVars:

newMVar :: IO (MVar a)
creates a new MVar.

takeMVar :: MVar a -> IO a
blocks until the location is non-empty, then reads and returns the value, leaving the location empty.

putMVar :: MVar a -> a -> IO ()
writes a value into the specified location. If there are one or more processes blocked in takeMVar on that location, one is thereby allowed to proceed. It is an error to perform putMVar on a location which already contains a value.

Notice that MVar can be considered in different ways:
as a channel for messages exchange between processes
MVar () is simple semaphor, putMVar denotes rising and takeMVar is sinking

With MVar we now can solve simple problem of “Producer and Customer” in case when Producer produces faster than customer can take. For this we'll make buffered slot CVar for Customer to take from.

type CVar a = (MVar a, -- Producer -> consumer
MVar ()) -- Consumer -> producer

newCVar :: IO (CVar a)
newCVar
= newMVar >>= \ data_var ->
newMVar >>= \ ack_var ->
putMVar ack_var ( ) >>
return (data_var, ack_var)

putCVar :: CVar a -> a -> IO ()
putCVar (data_var,ack_var) val
= takeMVar ack_var >>
putMVar data_var val

getCVar :: CVar a -> IO a
getCVar (data_var,ack_var)
= takeMVar data_var >>= \ val ->
putMVar ack_var () >>
return val

4. Haskell Software transactional Memory (HSTM)

Implementation of transactional memory in Haskell resembles IO abstraction. It introduces terms of special monadic type which represent atomic blocks. It also adds special mechanism to compose two transaction as alternatives. Main characteristics of HSTM look like this:

Strong or Weak Isolation Strong
Transaction Granularity Word
Direct or Deferred Update Deferred (cloned replacement)
Concurrency Control Optimistic
Synchronization Blocking
Conflict Detection Late
Inconsistent Reads Inconsistency toleration
Conflict Resolution None
Nested Transaction None (not allowed by type system)
Exceptions Abort

So, we add new type of terms into language – STM monad. It has semantics similar to IO monad but it “marks” terms which will be treated as atomic blocks. To wrap them into habitual atomic block we use function atomic.

atomic :: STM a → STM a

By analogy with MVar type in plain concurrent Haskell we introduce also TVar type (transactional) to represent values stable against transactional operations.

type TVar a
readTVar :: TVar a → STM a
writeTVar :: TVar a → a → STM()

For example let's consider such variable containing some integer values and operations on it.

type Resource = TVar Int
putR :: Resource -> Int -> STM ()
putR r i = do { v <- readTVar r; writeTVar r (v+i) }

The atomic function transactionally committed (or aborted) these updates:

main = do { ...; atomic (putR r 3); ... }

HSTM also introduced an explicit retry statement as a coordination mechanism between transactions. The retry statement aborts the current transaction and prevents it from reexecuting until at least one of the TVars accessed by the transaction changes value. For example,

getR :: Resource → Int → STM ()
getR r i = do { v <- readTVar r
; if (v < i) then retry
else writeTVar r (v-i) }

atomically extracts i units from a Resource. It uses a retry statement to abort an enclosing transaction if the Resource does not contain enough units. If this function executes retry, r is the only TVar read, so the transaction re-executes when r changes value.
HSTM also introduced the binary orElse operator for composing two transactions. This operator first starts its left-hand transaction. If this transaction commits, the orElse operator finishes. However, if this transaction retries, the operator tries the right-hand transaction instead. If this one commits, the orElse operator finishes. If it retries, the entire orElse statement waits for changes in the set of TVars read by both transactions before retrying. For example, this operator turns getR into an operation that returns a true/false success/failure result:

nonBlockGetR :: Resource -> Int ->STM Bool
nonBlockGetR r i = do { getR r i ; return True }‘orElse‘ return false

Notice, that retry operator “retries” largest enclosing term which has STM type.
Some words about implementation. Al transaction treads and writes to TVars deal with special transactional log which hides these variables references form other transactions. When the transaction commits, it first validates its log entries, to ensure that no other transaction modified the TVars values. If valid, the transaction installs the new values in these variables. If validation fails, the log is discarded and the transaction re-executed.
If a transaction invokes retry, the transaction is validated (to avoid retries caused by inconsistent execution) and the log discarded after recording all TVars read by the transaction. The system binds the transaction’s thread to each of these variables. When a transaction updates one of these variables, it also restarts the thread, which re-executes the transaction.
The orElse statement requires a closed nested transaction to surround each of the two alternatives, so that either one can abort without terminating the surrounding transaction. If either transaction completes successfully, its log is merged with the surrounding transaction’s log, which can commit. If either or both transactions invoke retry, the outer transaction waits on the union of the TVars read by the transactions that retried.

5. Garbage Collection in Concurrent Haskell


At the end it would be good to tell some words about garbage collection in Concurrent Haskell. It's interesting problem to collect processes which become “useless”. There is obvious strategy to do it if we ensure, that process we want to collect will not have “side effects” further. We can formulate two rules to do it:
  1. Running process cannot be collected
  2. We can collect process which holds some MVar if this variable is now inaccessible for any other non-garbage process.
At last classic “mark-and-sweep” tracing procedure can be implemented on processes:
  1. When tracing accessible heap objects, treat all runnable processes as roots.
  2. When some MVar is identified as reachable, identify all processes blocked by it as reachable ones..

List of used papers:

  1. Simon Peyton Jones, Andrew Gordon, Sigbjorn Finne “Concurrent Haskell”
  2. Paul Hudak, Simon Peyton Jones, Philip Wadler, Brian Boutel, Jon Fairbairn, Joseph Fasel, María M. Guzmán, Kevin Hammond, John Hughes, Thomas Johnsson, Dick Kieburtz, Rishiyur Nikhil, Will Partain, John Peterson “Report on the programming language Haskell: a non-strict, purely functional language version 1.2”




Wednesday 28 May 2008

Reading 7. HTM. Multiple atomic read-write operations

Stone et al., IEEE Concurrency 1993: Multiple atomic read-write operations ("Oklahoma Update")

Let's extend compare&swap command set to support multiple memory locations.
  • read-and-reserve - reads a memory location into a specified general-purpose register, places a reservation on the location’s address in the reservation register, and clears the reservation register’s data field.
  • store-contingent - locally updates the reservation register’s data field without obtaining write permissions.
  • write-if-reserved - specifies a set of reservation registers and updates the memory locations reserved by those registers. It is used to initiate the commit process. It attempts to obtain exclusive ownership for each of the addresses in the reservation registers. If the reservations remain valid during this process, the instruction updates memory with the modified data from the reservation registers. The instruction returns an indication whether the update succeeded or not.
Example

Having such commands we can write i.e. synchronized queue:
void Enqueue(newpointer) {
Memory[newpointer].next = NULL;
status = 0;

while (!status) {
last_pointer = Read_and_Reserve(Memory[tail].next, reservation1);
if (last_pointer == NULL) {
// this is an empty queue
first_pointer = Read_and_Reserve(Memory[head].next, reservation2);
Store_Contingent(newpointer, reservation1);
Store_Contingent(newpointer, reservation2);
status = Write_If_Reserved(reservation1, reservation2);
} else {
// non-empty queue
temp_pointer = Read_and_Reserve(Memory[last_pointer].next, reservation2);
Store_Contingent(newpointer, reservation1);
Store_Contingent(newpointer, reservation2);
status = Write_If_Reserved(reservation1, reservation2);
}
} // repeat until successful

return;
}
Implementation

Compared to compare&swap decomposition, hardware implementation for Oklahoma Update is much more complex.
Instead of one xa-address register this implementation requires number of reservation registers per CPU to store memory locations and various flags. Those are places
read-and-reserve and store-contingent writes to. And write-if-reserved can be called commit operation since it commits all local writes to shared memory.

Basically, commit operation consists of two phases:
  • Requesting write permissions. CPU ensures having exclusive write lock to every location from reservation registers. To avoid deadlocking to commit operations from other processors, CPU acquires locks in address-ascending manner.

    If it can't obtain every one lock entire conflict resolution takes place. CPU may try to restart the process after some time or abort the operation.
  • Commiting data values. As soon as processor has locks it starts uninterruptable operation to write data.
Interactions between the new instructions and regular store operations introduce forward-progress concerns. Regular stores do not participate in the new instructions’ conflict resolution mechanism. If a regular store from one processor conflicted with an address specified in a reservation register of another processor, this processor would abort its update. See more on this in the original paper.

Classification

Strong or weak isolationStrong
Transaction granularityCache line
Direct or deferred updateDeferred (in reservation registers)
Concurrency controlOptimistic. Commit initiates acquiring ownership
Conflict detectionLate write-write conflict (if not a regular store)
Late write-read conflict (if not a regular store)
Inconsistent readsNone
Conflict resolutionAddress-based rwo-phase commit
Nested transactionN/A


Sources
  • J.R. Larus, R.Rajwar Transactional Memory, 4.3.4
  • J. M. Stone, H. S. Stone, P. Heidelberger and J. Turek, “Multiple reservations and the Oklahoma update,” IEEE Concurrency, Vol. 1(4), pp. 58–71 Nov. 1993.

Reading 7. HTM. Compare&Swap decomposition

Jensen,Hagensen, and Broughton, UCRL 1987: support for optimistic synchronization using single memory location.

Complex compare&swap instruction can be splitted into simplier parts:
  • sync load. One memory location are loaded into special register (called xa-address), that indicates the processor requires exclusive access to this address, and loads the accessed data into a general register.
  • sync store. Stores data to previously saved address in xa-address if no other processor stored anything there (i.e. no conflict occurred) or conflict resolution decided to allow write by this processor.
  • sync clear. Clears xa-address.
A simple locking can be implemented in those instructions as follows:

// if (lock == 0) { lock = ProcessID; } % atomically
// else goto LockHeld... % lock was held
Retry:
sync_load R10, lock ; declare exclusive intent
jump_q .neq (R10,0), LockHeld ; test for zero
sync_clear ; lock non-zero, hence abort
load R10, ProcessID ; prepare to update lock
sync_store R10, lock ; update lock if not aborted
goto Retry ; try the update again
MyLock: ; got the lock
Implementation

sync_store instruction broadcasts write operation on all processors thus detecting conflicts with other processors having same xa-address. If such conflict detected, one processor clears own xa-address or aborts store operation (depends on conflict resolution scheme).

Real-world usage

MIPS, Alpha, PowerPC implemented variations to this scheme

Classification

Strong or weak isolation
Strong
Transaction granularity
Cache line
Direct or deferred update
Conditional direct store (single word)
Concurrency control
Optimistic (single word)
Conflict detection
N/A
Inconsistent reads
None
Conflict resolution
Processor id or first to request ownership
Nested transaction
N/A


Sources
  • J.R. Larus, R.Rajwar Transactional Memory, 4.3.3
  • E. H. Jensen, G. W. Hagensen and J. M. Broughton, “A new approach to exclusive data access in shared memory multiprocessors,” Lawrence Livermore National Laboratory, Technical Report UCRL-97663, Nov. 1987.

Reading 7. HTM. A side story

Knight 1986, paralleling a single-threaded programs

A compiler divides a program into a series of code blocks called transactions. For doing the division, the compiler assumes that these transactions do not have memory dependencies. These blocks then execute optimistically on the processors. The hardware enforces correct execution and uses caches to detect when a memory dependence violation between threads occurs.

This is the first paper, that proposed to use caches and cache coherence to maintain ordering among speculatively parallelized regions of a sequential code in the presence of unknown memory dependencies. While the paper did not directly address explicitly parallel programming, it set the groundwork for using caches and coherence protocols for future transactional memory proposals.

Implementation

Firstly, the compiler divides a program into sequence of "mostly independent" series of blocks (transactions). Than those blocks runs on shared-memory multiprocessors with own register state. All write memory operations cached in confirm cache. Every processor also stores all read operations in special dependency cache. Also write operations from other processors detected and used to update dependency cache (so called, bus sneaking).

All transactions are committed one-by-one as prescribed by original single-threaded program. Stored dependencies used to detect write conflicts. If one occurs, failed transaction simply runs again.

Classification
Strong or weak isolation Strong
Transaction granularity Cache line
Direct or deferred update Deferred (in cache)
Concurrency control Optimistic. Commit serialized globally
Conflict detection Late write-write conflict
Late write-read conflict
Inconsistent reads No
Conflict resolution Program order (sequential program)
Nested transaction N/A


Sources

Friday 25 April 2008

Reading 6. Lock Elision

Why do we need it?

In multithreaded programs, synchronization mechanism - usually locks - are often used to guarantee threads have exclusive access to shared data for a critical section of code. A thread acquires the lock, executes its critical section, and release the lock. The key insight is that locks do not always have to be acquired for a correct execution. The following code demonstrates this fact:

Thread 1

LOCK(hash_tbl.lock)
var = hash_tbl.lookup(X)
if (!var)
hash_tbl.add(X);
UNLOCK(hash_tbl.lock)


Thread 2

LOCK(hash_tbl.lock)
var = hash_tbl.lookup(Y)

if (!var)

hash_tbl.add(Y);

UNLOCK(hash_tbl.lock)


As we can see here this can be executed without any lock, because threads are updating different parts of the shared object.
While talking about the trade-offs of the multithreaded programming one can point:
  1. Conservative locking. To ensure correctness, programmers rely on conservative locking, often at the expense of performance.
  2. Granularity. Careful program design must choose appropriate levels of locking to optimize the tradeoff between performance and ease of reasoning about program correctness.
  3. Thread-unsafe libraries. If a thread calls a library not equipped to deal with threads, a global locking mechanism is used to prevent conflicts. And again the performance problem.
The solution is obvious now:
  1. Programmers use frequent and conservative synchronization to write correct multithreaded programs.
  2. Synchronization instructions are predicted as being unnecessary and elided.
  3. There is no system-level modification, only hardware support needed

Speculative Lock Elision

How the lock is constructed? Let's look at typical code for the lock-acquire and release using the Alpha ISA:

Ll: l. ldl tO, O(tl) # tO = lock
2. bne tO, LI: #if not free, goto L1
3. ldl_l tO, O(tl) #load locked, tO = lock
4. bne tO, LI: #if not free, goto L1
5. lda tO, 1(O) #tO = 1
6. stl_c tO, O(tl) #conditional store, lock = 1
7. beq tO, Li: #if stlc failed, goto Li,

8-15.

16. stl 0,0(tl) #lock = O, release lock

ldl_l/stl_c is the pair of so called LOAD-LOCKED/STORE-CONDITIONAL (ll/sc) instructions. ll reads the data and sc writes, if from the moment of reading the location nobody writes to it.

As it was said before the key observation is that a lock does not always have to be acquired for a correct execution if hardware can provide the appearance of atomicity for all memory operations within the critical section. If a data conflict occurs, i.e., two threads compete for the same data other than for reading, atomicity cannot be guaranteed and the lock need to be acquired. Data conflicts among threads are detected using existing cache coherence protocol.

To understand how point the critical section lets just look at the listing of the instructions above. The lock elision can be done by simply observing load and store sequences and the values read and to be written. We should use a filter to determine candidate load/store pairs. For example, in this implementation, only instructions tdl_t and stl c (they normally occur as a pair) are consider.

So the algorithm is the following:

  1. If candidate load (ldl_l) to an address is followed by store (stl_c of the lock acquire) to same address, predict another store (lock release) will shortly follow, restoring the memory location value to the one prior to this store (stl_c of the lock acquire).
  2. Predict memory operations in critical sections will occur atomically, and elide lock acquire.
  3. Execute critical section speculatively and buffer results.
  4. If hardware cannot provide atomicity, trigger misspeculation, recover and explicitly acquire lock.
  5. If second store (lock release) of step 1 seen, atomicity was not violated (else a misspeculation would have been triggered earlier). Elide lock-release store, commit state, and exit speculative critical section.


Some words about implementing SLE

To recover from an SLE misspeculation, register and memory state must be buffered until SLE is validated.
Two techniques for handling register state:
  1. Reorder buffer (ROB): each instruction is stored in buffer in the execution order, also there is a place in buffer for the instruction to store the result. Pay attention that the size of the ROB places a limit on the size of the critial section (for more details of the ROB look - J.E.Smith & A.R.Pleszkun. Implementation of Precise Interrupts in Pipeline processors)
  2. Register checkpoint: the idea is that a copy is done at the moment we elide the lock. This may be of dependence maps or of the architected register state itself. Instructions can be safely update the register file, speculatively retired, and be removed from the ROB because a correct architected register checkpoint exists for the recovery.
To store the memory state:
Although most modern processors support speculative load execution, they do not retire stores speculatively (i.e., write to the memory system speculatively). For supporting SLE, we augment existing processor write-buffers (between the processor and L1 cache) to buffer speculative memory updates. Speculative data is not committed from the write-buffer into the lower memory hierarchy until the lock elision is validated. On a
misspeculation, speculative entries in the write-buffer are invalidated.

To detects the conflicts the cache coherence protocol is used. Invalidation-based coherence protocol guarantee an exclusive copy of the memory block in the local cache when a store is performed. But the mechanism to record memory addresses read and write within the critical section is needed:
  1. If the ROB approach is used for SLE, no additional mechanisms are required for tracking external writes to memory locations speculatively read--the LSQ is already snooped.
  2. If the register checkpoint approach is used, the LSQ alone cannot be used to detect load violations for SLE because loads may speculatively retire and leave the ROB. In this case, the cache can be augmented with an access bit for each cache block. Every memory access executed during SLE marks the access bit of the corresponding block. On an external request, this bit is checked in parallel with the cache tags. Any external invalidation to a block with its access bit set, or an external request to a block in exclusive state with its access bit set, triggers a misspeculation. The bit can be stored with the tags and there is no overhead for the bit comparison because, for maintaining coherency, a tag lookup is already performed to snoop the cache.
Resource constraints or when we should use locks

Limited resources may force a misspeculation if either there is not enough buffer space to store speculative updates, or it is not possible to monitor accessed data to provide atomicity guarantees. Four such conditions for misspeculation are:
  1. Finite cache size. If register checkpoint is used, the cache may not be able to track all memory accesses.
  2. Finite write-buffer size. The number of unique cache lines modified exceeds the write-buffer size.
  3. Finite ROB size. If the checkpoint approach is used, the ROB size is not a problem.
  4. Uncached accesses or events (e.g., some system calls) where the processor cannot track requests.

Transactional Lock Removal

The main idea is to use SLE for lock-free execution but to use timestamp-based fair conflict resolution instead of locks (though some stituations - like recources constraints - still need locks).
The algorithm is based on Lamport clocks (L.Lamport. Time, clocks, and the ordering of events in a distributed system). Timestamps are used for resolving conflicts to decide a conflict winner - earlier timestamp implies higher priority.

The algorithm
  1. Calculate local timestamp
  2. Identify transaction start:
    • Initiate TLR mode (use SLE to elide locks)
    • Execute transaction speculatively.
  3. During transactional speculative execution
    • Locally buffer speculative updates
    • Append timestamp to all outgoing requests
    • If incoming request conflicts with retainable block and has later timestamp, retain ownership and force requestor to wait
    • If incoming request conflicts with retainable block and has earlier timestamp, service request and restart from step 2b if necessary. Give up any retained ownerships.
    • If insufficient resources, acquire lock.
      • No buffer space
      • Operation cannot be undone (e.g., I/O)
  4. Identify transaction end:
    • If all blocks available in local cache in appropriate coherence state, atomically commit memory updates from local buffer into cache (write to cache using SLE).
    • Commit transaction register (processor) state.
    • Service waiters if any
    • Update local timestamp

Propagating priority information

Let us consider now the following situation:
  1. There are 3 processors - P0, P1, P2
  2. They have priorities: P0>P1>P2
  3. P0 has an exclusive ownership for a block A
  4. P1 has an exclusive ownership for a block B
  5. At time t1 P1 send a request for A, that goes to Po (because now he is the owner)
  6. P0 receives the response, resolve conflict and so (P0>P1) wins it and defers the response
  7. Now P1 exclusively owns block A (because P1's request has been ordered by the protocol) but data (and write permission) is with P0 still
  8. P1 is waiting for P0 for cache block A
  9. P2 sends a request for P1 for the block B
  10. Situation is just the same as above - P2 owns B exclusively, but write permission is still with P1
  11. P2 is awaiting for P1
  12. P0 sends a request for B and it's forwarded to P2 (that owns B, though does not have data itself)
  13. P2 loses the conflict and should give block B to P0
  14. But P2 is awaiting for P1 to release the block B and P1 is awaiting fot P0 to release the block A
  15. deadlock!!!
Now let's think over the situation. The deadlock happened because the information about priorities was not propogated along the cache coherence protocol! To solve the problem we need the following:
  1. Marker message - directed message sent in responce to a request for a block under conflict for which data is not provided immediately.
  2. Probes - are used to propagate a conflict request upstream in a cache coherence protocol chain. Thus, when P2 receives P0's request for B, P2 forwards the probe (with P0's timestamp) to P1 since P2 received a marker message from P1. PI receives P0's forwarded probe (via P2) and loses the conflict because P0 has higher priority than P1. P1 releases ownership of block B and the cyclic wait is broken.

Thursday 27 March 2008

Reading 5. Improvements on DSTM and WSTM

Contention policies
  1. Aggressive. The acquiring transaction terminates any conflicting transaction
  2. Polite. The acquiring transaction uses exponential backoff to delay for a fixed number of exponentially growing intervals before aborting the other transaction. After each interval, the transaction checks if the other transaction has finished with the object.
  3. Timestamp. This manager aborts any transaction that started execution after the acquiring transaction.
  4. Published Timestamp. This manager follows the timestamp policy, but also aborts older transactions that appear inactive.
  5. Greedy. This manager aborts the transaction that has executed the least amount of time
  6. Kindergarten. This manager maintains a “hit list” of transactions to which a given transaction previously deferred. If the transaction holding the object is on the list, the acquiring transaction immediately terminates it. If it is not on the list, the acquiring transaction adds it to the list and then backs off for a fixed interval before aborting itself. This policy ensures that two transactions sharing an object take turns aborting (hence the name).
  7. Karma. This manager uses a count of the number of objects that a transaction has opened (cumulative, across all of its aborts and reexecutions) as a priority. An acquiring transaction immediately aborts another transaction with lower priority. If the acquiring transaction’s priority is lower, it backs off and tries to reacquire the object N times, where N is the difference in priorities, before aborting the other transaction.
  8. Eruption. This manager is similar to Karma, except that it adds the blocked transaction’s priority to the active transaction’s priority; to help reduce the possibility that a third transaction will subsequently abort the active transaction.
  9. Polka. This is a combination of the Polite and Karma policies. The key change to Karma is to use exponential backoff for the N intervals.
SXM
SXM is a deferred, object-based STM system implemented as a library for C# code. It is
similar in operation to DSTM system, although implemented in .NET.

Strong or Weak IsolationWeak
Transaction GranularityObject
Direct or Deferred UpdateDeferred (cloned replacement)
Concurrency ControlOptimistic
SynchronizationNonblocking (obstruction free)
Conflict DetectionEarly
Inconsistent ReadsInconsistency toleration
Conflict ResolutionExplicit contention manager
Nested TransactionClosed
Exceptions

Key features:

  • polymorphic contention management, which is a framework for managing conflicts between transactions
  • run-time code generation to produce some of the boilerplate code
Each transaction selects a contention manager from a collection of managers that implement a
diverse set of policies
. All policies are classified based on the cost of the state that they maintain

RankPolicy ClassPolicyState
1
Aggressive, Polite
2
Ad hoc
Greedy, KillblockedTransaction start time
3Local
TimestampTransaction start time, variable
4
KindergartenList of transactions
5Historical
Karma, Polka, EruptionList of objects

A higher–numbered policy is more expensive to compute than a lower–ranked one; but it
is not necessarily more predictive of future behavior. Ranking identifies policies that are comparable to each other.
SXM assumes that two transactions with policies from different policy classes were not intended to conflict, so neither transaction’s policy is preferable and uses the Greedy policy.
If the two transactions’ policies belong in the same class, SXM applies the conflict policy from the transaction that wants to acquire the object.

ASTM
Adaptive STM(ASTM) is a deferred, object-based STM system by Marathe, Scherer, and Scott that explored several performance improvements to the basic design of DSTM system

Strong or Weak IsolationWeak
Transaction GranularityObject
Direct or Deferred UpdateDeferred (cloned replacement)
Concurrency ControlOptimistic
SynchronizationNonblocking (obstruction free)
Conflict DetectionEarly or late (selectable)
Inconsistent ReadsValidation
Conflict ResolutionExplicit contention manager
Nested Transaction
Exceptions

Key features:
  • ASTM eliminated a memory indirection in accessing fields in an object open for reading
  • adaptive system to change its conflict resolution policy from early to late detection
An object opened for update has the same representation as in DSTM, with two levels of indirection between the TMObject and the object’s fields
An object opened for reading by a transaction has a single level of indirection, so the TMObject points directly to the object’s fields

Late and early conflict detection
With late conflict detection, a transaction does not acquire ownership of an object that it is modifying. Instead, it modifies a cloned copy of the object and defers conflict detection until the transaction commits, which reduces the interval in which an object is locked and avoids some unnecessary transaction aborts. When the transaction commits, it may find that the object modified or may find another transaction in the process of committing changes to the object. The first case causes the transaction to abort, while the second one invokes a contention manager to resolve the conflict

Late and early conflict detection have comparable overhead for most benchmarks, but early detection is easier to implement. Late acquire performs better for a long running transaction that reads a large number of objects and updates only a few, because this policy allows concurrent readers and writers.

ASTM implements an adaptive policy that recognizes a transaction that modifies few objects but reads many objects . ASTM then switches the thread executing the transaction from early to late detection for subsequent transactions. If a transaction falls below the thresholds, it reverts to early detection

Ananian and Rinard

Strong or Weak IsolationStrong
Transaction GranularityObject
Direct or Deferred UpdateDeferred (in-place)
Concurrency ControlOptimistic
SynchronizationNonblocking
Conflict DetectionEarly
Inconsistent ReadsInvalidation
Conflict ResolutionAbort conflicting transaction
Nested TransactionFlatten
ExceptionsTerminate or abort


Key features:
  • strong isolation
  • written in PROMELA, so it can be directly verified with the SPIN model checker
Each Java object is extended by two fields. The first, named versions, is a linked list containing a version record for each transaction that modified a field in the object. A version record identifies the transaction and records the updated value of each modified field. The second field, named readers, is a list of transactions that read a field in the object.

A novel aspect of this system is the use of a sentinel (signalling a conflicting access) value in a memory location to redirect read and write operations to the transactional data structures.

NT-read:
if it encounters sentinel, it aborts the transaction modifying the object containing the value, restores the field’s value from the most recently committed transaction’s record, clears versions list and re-reads location

NT-write aborts all transactions reading and writing the object and directly updates the field in the object. Readers and version lists are cleared. If the program is actually writing the sentinel value, the write instruction is treated as a short transaction to make difference with real sentinel

T-read first ensures that its transaction descriptor is on the object’s reader list. It aborts all uncommitted transactions that modified the object. After this, the transaction can read the field in the object and directly use any values other than the sentinel. The sentinel value, on the other hand, requires a search of the version records to find one with the same version and the updated value of the field. After this we can clear versions list.

A T-write aborts all other uncommitted transactions that read or wrote the object. It also creates, if none previously existed, a version object for the transaction. The next step is to copy the unmodified value in the field to all version records, including those of the committed transactions, so that the field can be restored if the transaction is rolled back. If the versions list does not contain a committed transaction, one must be created to hold this value. Finally, the new value can be written to the running transaction’s version record and the object field set to the sentinel value

This STM system aggressively resolves conflicting references to an object. A read aborts transactions in the process of modifying the object and a write aborts all transactions that accessed the object. In essence, the system implements a multireader, single-writer lock on an object.

RSTM

Strong or Weak IsolationWeak
Transaction GranularityObject
Direct or Deferred UpdateDeferred (cloned replacement)
Concurrency ControlOptimistic
SynchronizationNonblocking (obstruction-free)
Conflict DetectionEarly or late (selectable)
Inconsistent ReadsBounded invalidation
Conflict ResolutionConflict manager
Nested TransactionFlatten
Exceptions


Key features:
  • RSTM only uses a single level of indirection to an object, instead of the two levels used
    by previous systems such as DSTM
  • RSTM avoids dynamically allocating many of its data structures and contains its own memory collector, so it can work with nongarbage collected languages such as C++
  • RSTM uses invalidation to avoid inconsistent reads. It employs a new heuristic for
    tracking an object’s readers
In RSTM, every transactional object is accessed through an ObjectHeader, which points directly to the current version of the object. RSTM uses the low-order bit of the NewData field in this object as a flag

The TransactionDescriptor referenced through an object’s header determines the transaction’s state. If the transaction commits, then NewDataObject is the current version of the object. If the transaction aborts, then OldDataObject is the current version. If the transaction is active, no other transaction can read or write the object without aborting the transaction.

A transaction opens an object before accessing it:
  1. If opening the object for update, the transaction must first acquire the object with the following actions:
    (a) Read the object’s NewData pointer and make sure no other transaction owns it. If it is owned, invoke the contention manager.
    (b) Allocate the NewDataObject and copy values from the object’s current version.
    (c) Initialize the Owner and OldData pointers in the new object.
    (d) Use a CAS to atomically swap the pointer read in step (a) with a pointer to the newly allocated copy.
    (e) Add the object to transaction’s private write list.
    (f ) Iterate through the object’s visible reader list, aborting all transactions it contains.
  2. If opening the object for reading and space is available in the object’s visible readerlist, add the transaction to this list. If the list is full, add the object to the transaction’s private read list.
  3. Check the status word in the transaction’s descriptor, to make sure another transaction has not aborted it.
  4. Incrementally validate all objects on the transaction’s private read list.
Introducing visible reader list should have resulted in performance boost, because it reduces the cost of validation. But benchmarks show that visible readers are actually more costly, because of the extra cache traffic caused by updating the visible read table in each object

TL
Dice and Shavit described an STM system called transactional locking (TL), which combined deferred update with blocking synchronization


Strong or Weak IsolationWeak
Transaction GranularityObject, word or region
Direct or Deferred UpdateDeferred (in-place)
Concurrency ControlOptimistic
SynchronizationBlocking
Conflict DetectionEarly or late (selectable)
Inconsistent ReadsInconsistency toleration
Conflict ResolutionDelay and abort
Nested TransactionFlatten
Exceptions

Key features:
  • uses blocking to acquire an object
Each object is associated with a lock. A lock is either locked, and pointing to the transaction holding exclusive access, or unlocked, and recording the object’s version, which is incremented when a transaction updates the object.

A transaction maintains a read set and a write set. An entry in the read set contains the address of the object and the version number of the lock associated with the object. An entry in the write set contains the address of the object, the address of its lock, and the updated value of the object.

When the transaction executes a write, it first looks for the object’s entry in its write set. If
it is not present, the transaction creates an entry for the object. The write modifies the entry in the write set, not the actual object. The transaction does not acquire the lock.

A memory load first checks the write set, to determine if the transaction previously updated the object. If so, it uses the updated value from the write set. If the object is not in the set, the transaction adds the referenced object to the read set and attempts to read the actual object. If another transaction locked the object, the reading transaction can either delay and retry or abort itself.

When a transaction commits, it first acquires locks for all objects in its write set. A transaction will only wait a bounded amount of time for a lock to be released to avoid deadlock. After acquiring locks for the objects it modified, the transaction validates its read set. If successful, the transaction can complete by copying updated values into the objects, releasing the locks, and freeing its data structures.

Performance measurements on simple benchmarks showed that TL performed better than other STM systems, such as Harris and Fraser’s nonblocking WSTM system