Transactional Memory 101

Reading 7. HTM. Herlihy and Moss 1993. Transactional memory

2008-10-19T19:13:00.009+04:00

Herlihy and Moss, ISCA 1993: Transaction memory

This paper coined the term transactional memory, and identified the use of cache mechanisms for performing optimistic synchronization to multiple memory locations in a shared-memory multiprocessor. This was in contrast to the Oklahoma Update proposal that used reservation registers instead of a transactional cache.

Programming Interface
The proposal introduces six new instructions: load-transactional, load-transactional-exclusive, store-transactional, commit, abort, and validate (all those commands are pretty self-describing). The programmer uses these instructions for lock-free data structures and was responsible for saving register state and for ensuring forward progress. Transactions is expected to be short lived and complete in one scheduling quantum.

The following code sequence demonstrates the use of the new instructions to insert an element into a doubly linked list:


// Usage of new instructions to construct data structure
typedef struct list_elem {
  struct list_elem *next;
  struct list_elem *prev;
  int value;
} entry;

entry *Head, *Tail;

void Enqueue(entry* new) {
  entry *old_tail;
  unsigned backoff = BACKOFF_MIN;
  unsigned wait;
  new->next = new->prev = NULL;
  while (TRUE) {
    old_tail = (entry*) LOAD_TRANSACTIONAL_EXCLUSIVE(&Tail);
    if (VALIDATE()) { // ensure transaction still valid
      STORE_TRANSACTIONAL(&new->prev, old_tail);
      if (old_tail == NULL) {
        STORE_TRANSACTIONAL(&Head, new);
      } else {
        STORE_TRANSACTIONAL(&old_tail->next, new);
      }

      STORE_TRANSACTIONAL(&Tail, new);
      if (COMMIT()) // try to commit
        return;
    }
    
    wait = random() % (01 << backoff);
    while (wait--);
    if (backoff < BACKOFF_MAX)
      backoff++;
  }
}

Implementation

A regular CPU is adjusted with fully-associative transactional cache just to store per-transaction data. Also processor equipped with two flags for saving transaction state. The next are not actually new: cache coherence protocol enhanced with handling new cache and flags, sniffing on other processors doings. The paper itself digs into complete description of that protocol including state machine.

As transactional cache is limited in size and can grew too large the authors proposed to use software emulation of larger cache. This allows using hardware for the common case and software for the exceptional case.

Classification

Strong or weak isolation	Strong
Transaction granularity	Cache line
Direct or deferred update	Deferred (in cache)
Concurrency control	Optimistic
Conflict detection	Early
Inconsistent reads	Yes
Conflict resolution	Receiver NACKs/Requestor software backoff
Nested transaction	N/A

Materials

J.R. Larus, R.Rajwar Transactional Memory, 4.3.5
M. Herlihy and J. E. B. Moss, “Transactional memory: Architectural support for lock-free data structures,” In Proc. 20th Annu. Int. Symp. on Computer Architecture, pp. 289–300 May 1993

Reading 8. Concurrent Haskell and HSTM

2008-05-30T02:34:00.076+04:00

1. Gentle introduction to IO Monad in Haskell

Our following talk about model of Software Transactional Memory in Haskell will be meaningless without discussing concepts of concurrent Haskell.
Main ideas of concurrent Haskell were described in paper Concurrent Haskell by Simon Peyton Jones, Andrew Gordon and Sigbjorn Finne. In actual fact concurrent Haskell is simple extension of pure lazy-evaluated functional Haskell language. It adds two main new ingredients to Haskell:

processes, and a mechanism for process initiation

atomically-mutable state, to support inter-process communication and cooperation

Following the tradition it may seem strange to talk about some concurrency in pure functional languages like Haskell is because of concurrency concepts suppose existence of mutable entities and states. The most common way to emulate mutable state in Haskell is to wrap our computation result into Monad entity. From this point of view our program become a sequence of new monads creation. This approach helps also to emulate strict sequence of computations which is non-obvious in lazy-evaluated Haskell.
Let's recall what Monad is. Speaking strictly, Monad is tuple (M, return, >>=) where M is type designator, return is operator to wrap our computation into monad entity and >>= or bind operator represents monadic evaluations themselves. Their types are correspondingly


type Monad = M a
(>>=) :: M a → (a → M b) → M b
return :: a → M a

In some it is convenient to use reduced bind operator:


(>>) :: M a → M b → M b

In other words it ignores value wrapped into its first argument.
Most common examples of monads in Haskell are [], Maybe and IO. The last is most interesting for our purposes. In a non-strict language it is completely impractical to perform input/output using “side-effecting functions", because the order in which sub-expressions are evaluated | and indeed whether they are evaluated at all | is determined by the context in which the result of the expression is used, and hence is hard to predict. This difficulty can be addressed by treating an I/O-performing computation as a state transformer; that is, a function that transforms the current state of the world to a new state. In addition, we need the ability for an I/O-performing computation to return a result. This reasoning leads to the following type definition:


type IO a = World -> (a, World)

That is, a value of type IO t takes a world state as input and delivers a modified world state together with a value of type t. Of course, the implementation performs the I/O right away - thereby modifying the state of the world “in place".
We call a value of type IO t an action. Here are two useful complete actions:


hGetChar :: Handle -> IO Char
hPutChar :: Handle -> Char -> IO ()

The action hGetChar reads a character from the specified handle (which identifies some file or other byte stream) and returns it as the result of the action. hPutChar takes a handle and a character and return an action that writes the character to the specified stream.
Actions can be combined in sequences using infix combinators >> and >>= described above. For example here is an action that reads a character from the standard input, and then prints it twice to the standard output:


hGetChar stdin        >>= \c ->
hPutChar stdout c    >>
hPutChar stdout c

The notation \c->E, for some expression E, denotes a lambda abstraction. In Haskell, the scope of a lambda abstraction extends as far to the right as possible; in this example the body of the \c-abstraction includes everything after the \c. The sequencing combinators, >> and >>=, feed the result state of their left hand argument to the input of their right hand argument, thereby forcing the two actions (via the data dependency) to be performed in the correct order. The combinator >> throws away the result of its first argument, while >>= takes the result of its first argument and passes it on to its second argument. The similarity of monadic I/O-performing programs to imperative programs is no surprise: when performing I/O we specifically want to impose a total order on I/O operations.
It is often also useful to have an action that performs no I/O, and immediately returns a specified value using return operator. For example, an echo action that reads a character, prints it and returns the character read, might look like this:


echo :: IO Char
echo = hGetChar stdin    >>= \c →
   hPutChar stdout        >>
   return c

So, the resulting program which can be compiled to do something might look like


main :: IO ()
main = echo   >>= \c ->
if c == '\n'
then return ()
else main

In principle, then, a program is just a state transformer that is applied to the real world to give a new world. In practice, however, it is crucial that the side-effects the program specifies are performed incrementally, and not all at once when the program finishes.

2. Processes in Haskell

Concurrent Haskell provides a new primitive forkIO, which starts a concurrent process:


forkIO :: IO () -> IO ()

forkIO is an action which takes an action, a, as its argument and spawns a concurrent process to perform that action. The I/O and other side effects performed by a are interleaved in an unspecified fashion with those that follow the forkIO. Here's an example:


let
-- loop ch prints an infinite sequence of ch's
loop ch = hPutChar stdout ch >> loop ch
in
forkIO (loop 'a')         >>
loop 'z'

The forkIO spawns a process which perform the action loop 'a'. Meanwhile, the “parent“ process continues on to perform loop 'z'. The result is than an infinite sequence of interleaved 'a's and 'z's appears on the screen. The exact interleaving is unspecified.
As a more realistic example of forkIO in action a mail tool night incorporate the following loop:


mailLoop
= getButtonPress b >>= \ v ->
case v of
Compose -> forkIO doCompose >>
                         mailLoop
...other things
doCompose :: IO ()        -- Pop up and manage
doCompose = ...           -- composition window

Here, getButtonPress is very like hGetChar; it awaits the next button press on button b, and then delivers a value indicating which button was pressed. This value is then treated by the case expression. If its value is Compose, then the action doCompose is forked to handle an independent composition window, while the main process continues with the next getButtonPress.
There are some interesting questions related to concurrency in Haskell.

1. Let's imagine that we have to evaluate value named 'c'. In common Haskell this value is represented internally by pointer to some closure which will be called and evaluate this value when someone will need in it. That's the famous Haskell's laziness.
Now each of concurrent processes may need in this value. Then the first who provoked 'c''s evaluation replaces pointer to 'c''s closure to some temporary object named thunk. Thunk indicates that value 'c' is currently evaluating. Other processes wait until evaluation ends.

2. Since the parent and child processes may both mutate (parts of) the same shared state (namely, the world), forkIO immediately introduces non-determinism. For example, if one process decides to read a file, and the other deletes it, the effect of running the program will be unpredictable. While this non-determinism is not desirable, it is not avoidable; indeed, every concurrent language is non-deterministic. The only way to enforce determinism would be by somehow constraining the two processes to work on separate parts of the state (different files, in our example). The trouble is that essentially all the interesting applications of concurrency involve the deliberate and controlled mutation of shared state, such as screen real estate, the file system, or the internal data structures of the program. The right solution, therefore, is to provide mechanisms which allow (though alas they cannot enforce) the safe mutation of shared state section.

3. forkIO is asymmetrical: when a process executes a forkIO, it spawns a child process that executes concurrently with the continued execution of the parent. It would have been possible to design a symmetrical fork, an approach taken by Jones & Hudak:


symFork :: IO a -> IO b -> IO (a,b)

The idea here is symFork p1 p2 is an action that forks two processes, p1 and p2. When both complete, the symFork pairs their results together and returns this pair as its result. We rejected this approach because it forces us to synchronize on the termination of the forked process. If the desired behavior is that the forked process lives as long as it desires, then we have to provide the whole of the rest of the parent as the other argument to symFork, which is extremely inconvenient.

3. Synchronization and communication

To make our processes interact with each other and organize synchronization between them we introduce spectial type


type MVar a

This is simple memory cell which can contain any value of type a or be empty. We define following primitive operations on MVars:


newMVar :: IO (MVar a)

creates a new MVar.


takeMVar :: MVar a -> IO a

blocks until the location is non-empty, then reads and returns the value, leaving the location empty.


putMVar :: MVar a -> a -> IO ()

writes a value into the specified location. If there are one or more processes blocked in takeMVar on that location, one is thereby allowed to proceed. It is an error to perform putMVar on a location which already contains a value.

Notice that MVar can be considered in different ways:
as a channel for messages exchange between processes
MVar () is simple semaphor, putMVar denotes rising and takeMVar is sinking

With MVar we now can solve simple problem of “Producer and Customer” in case when Producer produces faster than customer can take. For this we'll make buffered slot CVar for Customer to take from.


type CVar a = (MVar a,                 -- Producer -> consumer
         MVar ())                                              -- Consumer -> producer

newCVar :: IO (CVar a)
newCVar
  = newMVar                            >>= \ data_var ->
newMVar                            >>= \ ack_var ->
putMVar ack_var (    )        >>
return (data_var, ack_var)

putCVar :: CVar a -> a -> IO ()
putCVar (data_var,ack_var) val
= takeMVar ack_var >>
 putMVar data_var val

getCVar :: CVar a -> IO a
getCVar (data_var,ack_var)
= takeMVar data_var      >>= \ val ->
 putMVar ack_var ()     >>
 return val

4. Haskell Software transactional Memory (HSTM)

Implementation of transactional memory in Haskell resembles IO abstraction. It introduces terms of special monadic type which represent atomic blocks. It also adds special mechanism to compose two transaction as alternatives. Main characteristics of HSTM look like this:

Strong or Weak Isolation	Strong
Transaction Granularity	Word
Direct or Deferred Update	Deferred (cloned replacement)
Concurrency Control	Optimistic
Synchronization	Blocking
Conflict Detection	Late
Inconsistent Reads	Inconsistency toleration
Conflict Resolution	None
Nested Transaction	None (not allowed by type system)
Exceptions	Abort

So, we add new type of terms into language – STM monad. It has semantics similar to IO monad but it “marks” terms which will be treated as atomic blocks. To wrap them into habitual atomic block we use function atomic.


atomic :: STM a → STM a

By analogy with MVar type in plain concurrent Haskell we introduce also TVar type (transactional) to represent values stable against transactional operations.


type TVar a
readTVar :: TVar a → STM a
writeTVar :: TVar a → a → STM()

For example let's consider such variable containing some integer values and operations on it.


type Resource = TVar Int
putR :: Resource -> Int -> STM ()
putR r i = do { v <- readTVar r; writeTVar r (v+i) }

The atomic function transactionally committed (or aborted) these updates:


main = do { ...; atomic (putR r 3); ... }

HSTM also introduced an explicit retry statement as a coordination mechanism between transactions. The retry statement aborts the current transaction and prevents it from reexecuting until at least one of the TVars accessed by the transaction changes value. For example,


getR :: Resource → Int → STM ()
getR r i = do { v <- readTVar r
                  ; if (v < i) then retry
                     else writeTVar r (v-i) }

atomically extracts i units from a Resource. It uses a retry statement to abort an enclosing transaction if the Resource does not contain enough units. If this function executes retry, r is the only TVar read, so the transaction re-executes when r changes value.
HSTM also introduced the binary orElse operator for composing two transactions. This operator first starts its left-hand transaction. If this transaction commits, the orElse operator finishes. However, if this transaction retries, the operator tries the right-hand transaction instead. If this one commits, the orElse operator finishes. If it retries, the entire orElse statement waits for changes in the set of TVars read by both transactions before retrying. For example, this operator turns getR into an operation that returns a true/false success/failure result:


nonBlockGetR :: Resource -> Int ->STM Bool
nonBlockGetR r i = do { getR r i ; return True }‘orElse‘ return false

Notice, that retry operator “retries” largest enclosing term which has STM type.
Some words about implementation. Al transaction treads and writes to TVars deal with special transactional log which hides these variables references form other transactions. When the transaction commits, it first validates its log entries, to ensure that no other transaction modified the TVars values. If valid, the transaction installs the new values in these variables. If validation fails, the log is discarded and the transaction re-executed.
If a transaction invokes retry, the transaction is validated (to avoid retries caused by inconsistent execution) and the log discarded after recording all TVars read by the transaction. The system binds the transaction’s thread to each of these variables. When a transaction updates one of these variables, it also restarts the thread, which re-executes the transaction.
The orElse statement requires a closed nested transaction to surround each of the two alternatives, so that either one can abort without terminating the surrounding transaction. If either transaction completes successfully, its log is merged with the surrounding transaction’s log, which can commit. If either or both transactions invoke retry, the outer transaction waits on the union of the TVars read by the transactions that retried.

5. Garbage Collection in Concurrent Haskell

At the end it would be good to tell some words about garbage collection in Concurrent Haskell. It's interesting problem to collect processes which become “useless”. There is obvious strategy to do it if we ensure, that process we want to collect will not have “side effects” further. We can formulate two rules to do it:

Running process cannot be collected
We can collect process which holds some MVar if this variable is now inaccessible for any other non-garbage process.

At last classic “mark-and-sweep” tracing procedure can be implemented on processes:

When tracing accessible heap objects, treat all runnable processes as roots.
When some MVar is identified as reachable, identify all processes blocked by it as reachable ones..

List of used papers:

Simon Peyton Jones, Andrew Gordon, Sigbjorn Finne “Concurrent Haskell”
Paul Hudak, Simon Peyton Jones, Philip Wadler, Brian Boutel, Jon Fairbairn, Joseph Fasel, María M. Guzmán, Kevin Hammond, John Hughes, Thomas Johnsson, Dick Kieburtz, Rishiyur Nikhil, Will Partain, John Peterson “Report on the programming language Haskell: a non-strict, purely functional language version 1.2”

Reading 7. HTM. Multiple atomic read-write operations

2008-05-28T19:50:00.005+04:00

Stone et al., IEEE Concurrency 1993: Multiple atomic read-write operations ("Oklahoma Update")

Let's extend compare&swap command set to support multiple memory locations.

read-and-reserve - reads a memory location into a speciﬁed general-purpose register, places a reservation on the location’s address in the reservation register, and clears the reservation register’s data ﬁeld.
store-contingent - locally updates the reservation register’s data ﬁeld without obtaining write permissions.
write-if-reserved - speciﬁes a set of reservation registers and updates the memory locations reserved by those registers. It is used to initiate the commit process. It attempts to obtain exclusive ownership for each of the addresses in the reservation registers. If the reservations remain valid during this process, the instruction updates memory with the modiﬁed data from the reservation registers. The instruction returns an indication whether the update succeeded or not.

Example

Having such commands we can write i.e. synchronized queue:

void Enqueue(newpointer) {
 Memory[newpointer].next = NULL;
 status = 0;

 while (!status) {
   last_pointer = Read_and_Reserve(Memory[tail].next, reservation1);
   if (last_pointer == NULL) {
     // this is an empty queue
     first_pointer = Read_and_Reserve(Memory[head].next, reservation2);
     Store_Contingent(newpointer, reservation1);
     Store_Contingent(newpointer, reservation2);
     status = Write_If_Reserved(reservation1, reservation2);
   } else {
     // non-empty queue
     temp_pointer = Read_and_Reserve(Memory[last_pointer].next, reservation2);
     Store_Contingent(newpointer, reservation1);
     Store_Contingent(newpointer, reservation2);
     status = Write_If_Reserved(reservation1, reservation2);
   }
 } // repeat until successful

 return;
}

Implementation

Compared to compare&swap decomposition, hardware implementation for Oklahoma Update is much more complex.
Instead of one xa-address register this implementation requires number of reservation registers per CPU to store memory locations and various flags. Those are places
read-and-reserve and store-contingent writes to. And write-if-reserved can be called commit operation since it commits all local writes to shared memory.

Basically, commit operation consists of two phases:

Requesting write permissions. CPU ensures having exclusive write lock to every location from reservation registers. To avoid deadlocking to commit operations from other processors, CPU acquires locks in address-ascending manner.

If it can't obtain every one lock entire conflict resolution takes place. CPU may try to restart the process after some time or abort the operation.
Commiting data values. As soon as processor has locks it starts uninterruptable operation to write data.

Interactions between the new instructions and regular store operations introduce forward-progress concerns. Regular stores do not participate in the new instructions’ conflict resolution mechanism. If a regular store from one processor conflicted with an address specified in a reservation register of another processor, this processor would abort its update. See more on this in the original paper.

Classification

Strong or weak isolation	Strong
Transaction granularity	Cache line
Direct or deferred update	Deferred (in reservation registers)
Concurrency control	Optimistic. Commit initiates acquiring ownership
Conflict detection	Late write-write conflict (if not a regular store) Late write-read conflict (if not a regular store)
Inconsistent reads	None
Conflict resolution	Address-based rwo-phase commit
Nested transaction	N/A

Sources

J.R. Larus, R.Rajwar Transactional Memory, 4.3.4
J. M. Stone, H. S. Stone, P. Heidelberger and J. Turek, “Multiple reservations and the Oklahoma update,” IEEE Concurrency, Vol. 1(4), pp. 58–71 Nov. 1993.

Reading 7. HTM. Compare&Swap decomposition

2008-05-28T19:20:00.005+04:00

Jensen,Hagensen, and Broughton, UCRL 1987: support for optimistic synchronization using single memory location.

Complex compare&swap instruction can be splitted into simplier parts:

sync load. One memory location are loaded into special register (called xa-address), that indicates the processor requires exclusive access to this address, and loads the accessed data into a general register.
sync store. Stores data to previously saved address in xa-address if no other processor stored anything there (i.e. no conflict occurred) or conflict resolution decided to allow write by this processor.
sync clear. Clears xa-address.

A simple locking can be implemented in those instructions as follows:

// if (lock == 0) { lock = ProcessID; } % atomically
// else goto LockHeld...                % lock was held
Retry:
sync_load R10, lock           ; declare exclusive intent
jump_q .neq (R10,0), LockHeld ; test for zero
sync_clear                    ; lock non-zero, hence abort
load R10, ProcessID           ; prepare to update lock
sync_store R10, lock          ; update lock if not aborted
goto Retry                    ; try the update again
MyLock:                         ; got the lock

Implementation

sync_store instruction broadcasts write operation on all processors thus detecting conflicts with other processors having same xa-address. If such conflict detected, one processor clears own xa-address or aborts store operation (depends on conflict resolution scheme).

Real-world usage

MIPS, Alpha, PowerPC implemented variations to this scheme

Classification

Strong or weak isolation		Strong
Transaction granularity		Cache line
Direct or deferred update		Conditional direct store (single word)
Concurrency control		Optimistic (single word)
Conflict detection		N/A
Inconsistent reads		None
Conflict resolution		Processor id or first to request ownership
Nested transaction		N/A

Sources

J.R. Larus, R.Rajwar Transactional Memory, 4.3.3
E. H. Jensen, G. W. Hagensen and J. M. Broughton, “A new approach to exclusive data access in shared memory multiprocessors,” Lawrence Livermore National Laboratory, Technical Report UCRL-97663, Nov. 1987.

Reading 7. HTM. A side story

2008-05-28T19:06:00.007+04:00

Knight 1986, paralleling a single-threaded programs

A compiler divides a program into a series of code blocks called transactions. For doing the division, the compiler assumes that these transactions do not have memory dependencies. These blocks then execute optimistically on the processors. The hardware enforces correct execution and uses caches to detect when a memory dependence violation between threads occurs.

This is the ﬁrst paper, that proposed to use caches and cache coherence to maintain ordering among speculatively parallelized regions of a sequential code in the presence of unknown memory dependencies. While the paper did not directly address explicitly parallel programming, it set the groundwork for using caches and coherence protocols for future transactional memory proposals.

Implementation

Firstly, the compiler divides a program into sequence of "mostly independent" series of blocks (transactions). Than those blocks runs on shared-memory multiprocessors with own register state. All write memory operations cached in confirm cache. Every processor also stores all read operations in special dependency cache. Also write operations from other processors detected and used to update dependency cache (so called, bus sneaking).

All transactions are committed one-by-one as prescribed by original single-threaded program. Stored dependencies used to detect write conflicts. If one occurs, failed transaction simply runs again.

Classification

Strong or weak isolation		Strong
Transaction granularity		Cache line
Direct or deferred update		Deferred (in cache)
Concurrency control		Optimistic. Commit serialized globally
Conflict detection		Late write-write conflict Late write-read conflict
Inconsistent reads		No
Conflict resolution		Program order (sequential program)
Nested transaction		N/A

Sources

J.R. Larus, R.Rajwar Transactional Memory, 4.3.2
T. F. Knight, “An architecture for mostly functional languages,” In Proc. ACM Lisp and Functional Programming Conference, pp. 105–112 Aug. 1986

Reading 6. Lock Elision

2008-04-25T00:29:00.007+04:00

Why do we need it?

In multithreaded programs, synchronization mechanism - usually locks - are often used to guarantee threads have exclusive access to shared data for a critical section of code. A thread acquires the lock, executes its critical section, and release the lock. The key insight is that locks do not always have to be acquired for a correct execution. The following code demonstrates this fact:

Thread 1

LOCK(hash_tbl.lock)
var = hash_tbl.lookup(X)
if (!var)
hash_tbl.add(X);
UNLOCK(hash_tbl.lock)

Thread 2

LOCK(hash_tbl.lock)
var = hash_tbl.lookup(Y)
if (!var)
hash_tbl.add(Y);
UNLOCK(hash_tbl.lock)

As we can see here this can be executed without any lock, because threads are updating different parts of the shared object.
While talking about the trade-offs of the multithreaded programming one can point:

Conservative locking. To ensure correctness, programmers rely on conservative locking, often at the expense of performance.
Granularity. Careful program design must choose appropriate levels of locking to optimize the tradeoff between performance and ease of reasoning about program correctness.
Thread-unsafe libraries. If a thread calls a library not equipped to deal with threads, a global locking mechanism is used to prevent conflicts. And again the performance problem.

The solution is obvious now:

Programmers use frequent and conservative synchronization to write correct multithreaded programs.
Synchronization instructions are predicted as being unnecessary and elided.
There is no system-level modification, only hardware support needed

Speculative Lock Elision

How the lock is constructed? Let's look at typical code for the lock-acquire and release using the Alpha ISA:

Ll: l. ldl tO, O(tl) # tO = lock
2. bne tO, LI: #if not free, goto L1
3. ldl_l tO, O(tl) #load locked, tO = lock
4. bne tO, LI: #if not free, goto L1
5. lda tO, 1(O) #tO = 1
6. stl_c tO, O(tl) #conditional store, lock = 1
7. beq tO, Li: #if stlc failed, goto Li,

8-15.

16. stl 0,0(tl) #lock = O, release lock

ldl_l/stl_c is the pair of so called LOAD-LOCKED/STORE-CONDITIONAL (ll/sc) instructions. ll reads the data and sc writes, if from the moment of reading the location nobody writes to it.

As it was said before the key observation is that a lock does not always have to be acquired for a correct execution if hardware can provide the appearance of atomicity for all memory operations within the critical section. If a data conflict occurs, i.e., two threads compete for the same data other than for reading, atomicity cannot be guaranteed and the lock need to be acquired. Data conflicts among threads are detected using existing cache coherence protocol.

To understand how point the critical section lets just look at the listing of the instructions above. The lock elision can be done by simply observing load and store sequences and the values read and to be written. We should use a filter to determine candidate load/store pairs. For example, in this implementation, only instructions tdl_t and stl c (they normally occur as a pair) are consider.

So the algorithm is the following:

If candidate load (ldl_l) to an address is followed by store (stl_c of the lock acquire) to same address, predict another store (lock release) will shortly follow, restoring the memory location value to the one prior to this store (stl_c of the lock acquire).
Predict memory operations in critical sections will occur atomically, and elide lock acquire.
Execute critical section speculatively and buffer results.
If hardware cannot provide atomicity, trigger misspeculation, recover and explicitly acquire lock.
If second store (lock release) of step 1 seen, atomicity was not violated (else a misspeculation would have been triggered earlier). Elide lock-release store, commit state, and exit speculative critical section.

Some words about implementing SLE

To recover from an SLE misspeculation, register and memory state must be buffered until SLE is validated.
Two techniques for handling register state:

Reorder buffer (ROB): each instruction is stored in buffer in the execution order, also there is a place in buffer for the instruction to store the result. Pay attention that the size of the ROB places a limit on the size of the critial section (for more details of the ROB look - J.E.Smith & A.R.Pleszkun. Implementation of Precise Interrupts in Pipeline processors)
Register checkpoint: the idea is that a copy is done at the moment we elide the lock. This may be of dependence maps or of the architected register state itself. Instructions can be safely update the register file, speculatively retired, and be removed from the ROB because a correct architected register checkpoint exists for the recovery.

To store the memory state:
Although most modern processors support speculative load execution, they do not retire stores speculatively (i.e., write to the memory system speculatively). For supporting SLE, we augment existing processor write-buffers (between the processor and L1 cache) to buffer speculative memory updates. Speculative data is not committed from the write-buffer into the lower memory hierarchy until the lock elision is validated. On a
misspeculation, speculative entries in the write-buffer are invalidated.

To detects the conflicts the cache coherence protocol is used. Invalidation-based coherence protocol guarantee an exclusive copy of the memory block in the local cache when a store is performed. But the mechanism to record memory addresses read and write within the critical section is needed:

If the ROB approach is used for SLE, no additional mechanisms are required for tracking external writes to memory locations speculatively read--the LSQ is already snooped.
If the register checkpoint approach is used, the LSQ alone cannot be used to detect load violations for SLE because loads may speculatively retire and leave the ROB. In this case, the cache can be augmented with an access bit for each cache block. Every memory access executed during SLE marks the access bit of the corresponding block. On an external request, this bit is checked in parallel with the cache tags. Any external invalidation to a block with its access bit set, or an external request to a block in exclusive state with its access bit set, triggers a misspeculation. The bit can be stored with the tags and there is no overhead for the bit comparison because, for maintaining coherency, a tag lookup is already performed to snoop the cache.

Resource constraints or when we should use locks

Limited resources may force a misspeculation if either there is not enough buffer space to store speculative updates, or it is not possible to monitor accessed data to provide atomicity guarantees. Four such conditions for misspeculation are:

Finite cache size. If register checkpoint is used, the cache may not be able to track all memory accesses.
Finite write-buffer size. The number of unique cache lines modified exceeds the write-buffer size.
Finite ROB size. If the checkpoint approach is used, the ROB size is not a problem.
Uncached accesses or events (e.g., some system calls) where the processor cannot track requests.

Transactional Lock Removal

The main idea is to use SLE for lock-free execution but to use timestamp-based fair conflict resolution instead of locks (though some stituations - like recources constraints - still need locks).
The algorithm is based on Lamport clocks (L.Lamport. Time, clocks, and the ordering of events in a distributed system). Timestamps are used for resolving conflicts to decide a conflict winner - earlier timestamp implies higher priority.

The algorithm

Calculate local timestamp
Identify transaction start:

Initiate TLR mode (use SLE to elide locks)
Execute transaction speculatively.

During transactional speculative execution

Locally buffer speculative updates
Append timestamp to all outgoing requests
If incoming request conflicts with retainable block and has later timestamp, retain ownership and force requestor to wait
If incoming request conflicts with retainable block and has earlier timestamp, service request and restart from step 2b if necessary. Give up any retained ownerships.
If insufficient resources, acquire lock.

No buffer space

Operation cannot be undone (e.g., I/O)

Identify transaction end:

If all blocks available in local cache in appropriate coherence state, atomically commit memory updates from local buffer into cache (write to cache using SLE).
Commit transaction register (processor) state.
Service waiters if any
Update local timestamp

Propagating priority information

Let us consider now the following situation:

There are 3 processors - P0, P1, P2
They have priorities: P0>P1>P2
P0 has an exclusive ownership for a block A
P1 has an exclusive ownership for a block B
At time t1 P1 send a request for A, that goes to Po (because now he is the owner)
P0 receives the response, resolve conflict and so (P0>P1) wins it and defers the response
Now P1 exclusively owns block A (because P1's request has been ordered by the protocol) but data (and write permission) is with P0 still
P1 is waiting for P0 for cache block A
P2 sends a request for P1 for the block B
Situation is just the same as above - P2 owns B exclusively, but write permission is still with P1
P2 is awaiting for P1
P0 sends a request for B and it's forwarded to P2 (that owns B, though does not have data itself)
P2 loses the conflict and should give block B to P0
But P2 is awaiting for P1 to release the block B and P1 is awaiting fot P0 to release the block A
deadlock!!!

Now let's think over the situation. The deadlock happened because the information about priorities was not propogated along the cache coherence protocol! To solve the problem we need the following:

Marker message - directed message sent in responce to a request for a block under conflict for which data is not provided immediately.
Probes - are used to propagate a conflict request upstream in a cache coherence protocol chain. Thus, when P2 receives P0's request for B, P2 forwards the probe (with P0's timestamp) to P1 since P2 received a marker message from P1. PI receives P0's forwarded probe (via P2) and loses the conflict because P0 has higher priority than P1. P1 releases ownership of block B and the cyclic wait is broken.

Reading 5. Improvements on DSTM and WSTM

2008-03-27T23:04:00.004+03:00

Contention policies

Aggressive. The acquiring transaction terminates any conflicting transaction
Polite. The acquiring transaction uses exponential backoff to delay for a fixed number of exponentially growing intervals before aborting the other transaction. After each interval, the transaction checks if the other transaction has finished with the object.
Timestamp. This manager aborts any transaction that started execution after the acquiring transaction.
Published Timestamp. This manager follows the timestamp policy, but also aborts older transactions that appear inactive.
Greedy. This manager aborts the transaction that has executed the least amount of time
Kindergarten. This manager maintains a “hit list” of transactions to which a given transaction previously deferred. If the transaction holding the object is on the list, the acquiring transaction immediately terminates it. If it is not on the list, the acquiring transaction adds it to the list and then backs off for a fixed interval before aborting itself. This policy ensures that two transactions sharing an object take turns aborting (hence the name).
Karma. This manager uses a count of the number of objects that a transaction has opened (cumulative, across all of its aborts and reexecutions) as a priority. An acquiring transaction immediately aborts another transaction with lower priority. If the acquiring transaction’s priority is lower, it backs off and tries to reacquire the object N times, where N is the difference in priorities, before aborting the other transaction.
Eruption. This manager is similar to Karma, except that it adds the blocked transaction’s priority to the active transaction’s priority; to help reduce the possibility that a third transaction will subsequently abort the active transaction.
Polka. This is a combination of the Polite and Karma policies. The key change to Karma is to use exponential backoff for the N intervals.

SXM
SXM is a deferred, object-based STM system implemented as a library for C# code. It is
similar in operation to DSTM system, although implemented in .NET.

Strong or Weak Isolation	Weak
Transaction Granularity	Object
Direct or Deferred Update	Deferred (cloned replacement)
Concurrency Control	Optimistic
Synchronization	Nonblocking (obstruction free)
Conflict Detection	Early
Inconsistent Reads	Inconsistency toleration
Conflict Resolution	Explicit contention manager
Nested Transaction	Closed
Exceptions

Key features:

polymorphic contention management, which is a framework for managing conflicts between transactions
run-time code generation to produce some of the boilerplate code

Each transaction selects a contention manager from a collection of managers that implement a
diverse set of policies. All policies are classified based on the cost of the state that they maintain

Rank	Policy Class	Policy	State
1		Aggressive, Polite	–
2	Ad hoc	Greedy, Killblocked	Transaction start time
3	Local	Timestamp	Transaction start time, variable
4		Kindergarten	List of transactions
5	Historical	Karma, Polka, Eruption	List of objects

A higher–numbered policy is more expensive to compute than a lower–ranked one; but it
is not necessarily more predictive of future behavior. Ranking identifies policies that are comparable to each other.
SXM assumes that two transactions with policies from different policy classes were not intended to conflict, so neither transaction’s policy is preferable and uses the Greedy policy.
If the two transactions’ policies belong in the same class, SXM applies the conflict policy from the transaction that wants to acquire the object.

ASTM
Adaptive STM(ASTM) is a deferred, object-based STM system by Marathe, Scherer, and Scott that explored several performance improvements to the basic design of DSTM system

Strong or Weak Isolation	Weak
Transaction Granularity	Object
Direct or Deferred Update	Deferred (cloned replacement)
Concurrency Control	Optimistic
Synchronization	Nonblocking (obstruction free)
Conflict Detection	Early or late (selectable)
Inconsistent Reads	Validation
Conflict Resolution	Explicit contention manager
Nested Transaction
Exceptions

Key features:

ASTM eliminated a memory indirection in accessing fields in an object open for reading
adaptive system to change its conflict resolution policy from early to late detection

An object opened for update has the same representation as in DSTM, with two levels of indirection between the TMObject and the object’s fields
An object opened for reading by a transaction has a single level of indirection, so the TMObject points directly to the object’s fields

Late and early conflict detection
With late conflict detection, a transaction does not acquire ownership of an object that it is modifying. Instead, it modifies a cloned copy of the object and defers conflict detection until the transaction commits, which reduces the interval in which an object is locked and avoids some unnecessary transaction aborts. When the transaction commits, it may find that the object modified or may find another transaction in the process of committing changes to the object. The first case causes the transaction to abort, while the second one invokes a contention manager to resolve the conflict

Late and early conflict detection have comparable overhead for most benchmarks, but early detection is easier to implement. Late acquire performs better for a long running transaction that reads a large number of objects and updates only a few, because this policy allows concurrent readers and writers.

ASTM implements an adaptive policy that recognizes a transaction that modifies few objects but reads many objects . ASTM then switches the thread executing the transaction from early to late detection for subsequent transactions. If a transaction falls below the thresholds, it reverts to early detection

Ananian and Rinard

Strong or Weak Isolation	Strong
Transaction Granularity	Object
Direct or Deferred Update	Deferred (in-place)
Concurrency Control	Optimistic
Synchronization	Nonblocking
Conflict Detection	Early
Inconsistent Reads	Invalidation
Conflict Resolution	Abort conflicting transaction
Nested Transaction	Flatten
Exceptions	Terminate or abort

Key features:

strong isolation
written in PROMELA, so it can be directly verified with the SPIN model checker

Each Java object is extended by two fields. The first, named versions, is a linked list containing a version record for each transaction that modified a field in the object. A version record identifies the transaction and records the updated value of each modified field. The second field, named readers, is a list of transactions that read a field in the object.

A novel aspect of this system is the use of a sentinel (signalling a conflicting access) value in a memory location to redirect read and write operations to the transactional data structures.

NT-read:
if it encounters sentinel, it aborts the transaction modifying the object containing the value, restores the field’s value from the most recently committed transaction’s record, clears versions list and re-reads location

NT-write aborts all transactions reading and writing the object and directly updates the field in the object. Readers and version lists are cleared. If the program is actually writing the sentinel value, the write instruction is treated as a short transaction to make difference with real sentinel

T-read first ensures that its transaction descriptor is on the object’s reader list. It aborts all uncommitted transactions that modified the object. After this, the transaction can read the field in the object and directly use any values other than the sentinel. The sentinel value, on the other hand, requires a search of the version records to find one with the same version and the updated value of the field. After this we can clear versions list.

A T-write aborts all other uncommitted transactions that read or wrote the object. It also creates, if none previously existed, a version object for the transaction. The next step is to copy the unmodified value in the field to all version records, including those of the committed transactions, so that the field can be restored if the transaction is rolled back. If the versions list does not contain a committed transaction, one must be created to hold this value. Finally, the new value can be written to the running transaction’s version record and the object field set to the sentinel value

This STM system aggressively resolves conflicting references to an object. A read aborts transactions in the process of modifying the object and a write aborts all transactions that accessed the object. In essence, the system implements a multireader, single-writer lock on an object.

RSTM

Strong or Weak Isolation	Weak
Transaction Granularity	Object
Direct or Deferred Update	Deferred (cloned replacement)
Concurrency Control	Optimistic
Synchronization	Nonblocking (obstruction-free)
Conflict Detection	Early or late (selectable)
Inconsistent Reads	Bounded invalidation
Conflict Resolution	Conflict manager
Nested Transaction	Flatten
Exceptions

Key features:

RSTM only uses a single level of indirection to an object, instead of the two levels used
by previous systems such as DSTM
RSTM avoids dynamically allocating many of its data structures and contains its own memory collector, so it can work with nongarbage collected languages such as C++
RSTM uses invalidation to avoid inconsistent reads. It employs a new heuristic for
tracking an object’s readers

In RSTM, every transactional object is accessed through an ObjectHeader, which points directly to the current version of the object. RSTM uses the low-order bit of the NewData field in this object as a flag

The TransactionDescriptor referenced through an object’s header determines the transaction’s state. If the transaction commits, then NewDataObject is the current version of the object. If the transaction aborts, then OldDataObject is the current version. If the transaction is active, no other transaction can read or write the object without aborting the transaction.

A transaction opens an object before accessing it:

If opening the object for update, the transaction must first acquire the object with the following actions:
(a) Read the object’s NewData pointer and make sure no other transaction owns it. If it is owned, invoke the contention manager.
(b) Allocate the NewDataObject and copy values from the object’s current version.
(c) Initialize the Owner and OldData pointers in the new object.
(d) Use a CAS to atomically swap the pointer read in step (a) with a pointer to the newly allocated copy.
(e) Add the object to transaction’s private write list.
(f ) Iterate through the object’s visible reader list, aborting all transactions it contains.
If opening the object for reading and space is available in the object’s visible readerlist, add the transaction to this list. If the list is full, add the object to the transaction’s private read list.
Check the status word in the transaction’s descriptor, to make sure another transaction has not aborted it.
Incrementally validate all objects on the transaction’s private read list.

Introducing visible reader list should have resulted in performance boost, because it reduces the cost of validation. But benchmarks show that visible readers are actually more costly, because of the extra cache traffic caused by updating the visible read table in each object

TL
Dice and Shavit described an STM system called transactional locking (TL), which combined deferred update with blocking synchronization

Strong or Weak Isolation	Weak
Transaction Granularity	Object, word or region
Direct or Deferred Update	Deferred (in-place)
Concurrency Control	Optimistic
Synchronization	Blocking
Conflict Detection	Early or late (selectable)
Inconsistent Reads	Inconsistency toleration
Conflict Resolution	Delay and abort
Nested Transaction	Flatten
Exceptions

Key features:

uses blocking to acquire an object

Each object is associated with a lock. A lock is either locked, and pointing to the transaction holding exclusive access, or unlocked, and recording the object’s version, which is incremented when a transaction updates the object.

A transaction maintains a read set and a write set. An entry in the read set contains the address of the object and the version number of the lock associated with the object. An entry in the write set contains the address of the object, the address of its lock, and the updated value of the object.

When the transaction executes a write, it first looks for the object’s entry in its write set. If
it is not present, the transaction creates an entry for the object. The write modifies the entry in the write set, not the actual object. The transaction does not acquire the lock.

A memory load first checks the write set, to determine if the transaction previously updated the object. If so, it uses the updated value from the write set. If the object is not in the set, the transaction adds the referenced object to the read set and attempts to read the actual object. If another transaction locked the object, the reading transaction can either delay and retry or abort itself.

When a transaction commits, it first acquires locks for all objects in its write set. A transaction will only wait a bounded amount of time for a lock to be released to avoid deadlock. After acquiring locks for the objects it modified, the transaction validates its read set. If successful, the transaction can complete by copying updated values into the objects, releasing the locks, and freeing its data structures.

Performance measurements on simple benchmarks showed that TL performed better than other STM systems, such as Harris and Fraser’s nonblocking WSTM system

Programme

2008-03-10T13:02:00.010+03:00

Here is a list of readings:

Deferred Update systems:
1. DSTM, WSTM and contention management (3.4.1-3.4.4) [Oleg]
2. Improvements on DSTM and WSTM (3.4.4-3.4.9) [Denis]

Direct Update systems:
3. McRT-STM and compiler optimizations (3.5.2-3.5.3) [Roma]
4. Bartok STM and compiler optimizations (3.5.4) [Lena]
Skipping: Autolocker

Language-oriented STMs:
5. HSTM for Concurent Haskell and AutoCaml (3.6.3-3.6.4) [Ilya]
Skipping: discussion of exceptions thrown from atomics, real-time Java stuff

HTM
Precursors
6. Optimistic synchronisation in hardware (4.3.2-4.3.5) [Leonid]
7. Lock elision (4.3.6-4.3.7) [Nastasia]
Skipping: IBM 801
HTM designs
8. Bounded/Large HTMs (4.4) [Kostya]
9. Unbounded HTMs (4.5)
10. Hybrid HTM-STMs (4.6)

Reading 4. Word granularity STM

2008-03-10T12:49:00.017+03:00

March 14, 2008

Harris and Fraser’s 2003 OOPSLA paper was the first to describe a practical STM system integrated into a programming language. They implemented WSTM word-granularity STM) in the ResearchVM from Sun Labs.
WSTM benefits

WSTM did not require a programmer to declare the memory locations accessed within a transaction.
WSTM was integrated into a modern, object-oriented language ( Java) by extending the language with the atomic operation. Strangely enough, WSTM did not exploit the object-oriented nature of Java and could support procedural languages as well.

Strong or Weak Isolation	Weak
Transaction Granularity	Word
Direct or Deferred Update	Deferred (update in place)
Concurrency Control	Optimistic
Synchronization	Obstruction free
Conflict Detection	Late
Inconsistent Reads	Inconsistency toleration
Conflict Resolution	Helping or aborting
Nested Transaction	Flattened
Exceptions	Terminate

WSTM extended Java with a new statement:


atomic (condition) { 
   statements; 
}

A modified JIT ( Just-In-Time, i.e., run-time) compiler translated this statement into


bool done = false; 
   while (!done) { 
            STMStart(); 
            try { 
                if (condition) { 
                    statements; 
                    done = STMCommit(); 
                } else { 
                    STMWait(); 
                } 
            } catch (Exception t) { 
              done = STMCommit(); 
              if (done) { 
                  throw t; 
              } 
            } 
   }

Note: an exception within the atomic region’s predicate or body causes the transaction to commit. Subsequent systems more typically treated an exception as an error that aborts a transaction.
The code produced by the compiler relies on five primitive operations provided by the


void   STMStart() 
void   STMAbort() 
bool   STMCommit() 
bool   STMValidate() 
void   STMWait()

In addition, all references to object fields from statements within an atomic region are replaced by calls to an appropriate library operation:


STMWord STMRead(Addr a) 
void STMWrite(Addr a, STMWord w)

Restrictions: Because the JVM’s JIT compiler performs this translation, only references from Java bytecodes, not those in native methods, are translated to access these auxiliary structures. A few native methods were hand translated and included in a WSTM library, but a call on most native methods (including those that perform IO operations) from a transaction would cause a run-time error.


enum TransactionStatus { ACTIVE, COMMITTED, ABORTED, ASLEEP }; 
class TransactionEntry { 
     public Addr loc; 
     public STMWord oldValue; 
     public STMWord oldVersion; 
     public STMWord newValue; 
     public STMWord newVersion; 
} 
class TransactionDescriptor { 
     public TransactionStatus status = ACTIVE; 
     int nestingDepth = 0; 
     public Set entries; 
}

The status field records the transaction status. A transaction starts in state ACTIVE and makes a transition into one of the three other states. The nestingDepth field records the number of (flattened) transactions sharing this descriptor.
The entries field holds a TransactionEntry record for each location the transaction
reads or writes. The compiler redirects memory reads and writes to the appropriate descriptor. A TransactionEntry records a location’s original value (before the transaction’s first access) and its current value. For a location only read, the two values are identical. A transaction increments the version number when it modifies the location. The system uses the version number to detect conflicts.

OwnershipRec records the version number of the memory location, produced by the most recent committed transaction that updated the location. Second, when a transaction is in the process of committing, an OwnershipRec records the transaction that has acquired exclusive ownership of the location. Each ownership record holds either a version number or a pointer to the transaction that owns the location:


class OwnershipRec { 
     union { 
         public STMWord version; 
         public TransactionDescriptor* trans; 
     } val; 
     public bool HoldsVersion() { return (val.version & 0x1) != 0; } 
}

The function FindOwnershipRec(a) maps memory address a to its associated OwnershipRec. The function CASOwnershipRec(a, old, new) performs a compare-and-swap operation on the OwnershipRec for memory address a, replacing it with value new, if the existing entry is equal to old.


void STMStart() { 
   if (ActiveTrans == null || ActiveTrans.status != TransactionStatus.ACTIVE) { 
        ActiveTrans = new TransactionDescriptor(); 
        ActiveTrans.status = TransactionStatus.ACTIVE; 
   } 
   AtomicAdd(ActiveTrans.nestingDepth, 1); 
} 

void STMAbort() { 
   ActiveTrans.status = TransactionStatus.ABORTED; 
   ActiveTrans.entries = null; 
   AtomicAdd(ActiveTrans.nestingDepth, -1); 
} 

struct ValVersion { 
   public STMWord val; 
   public STMWord version; 
} 

STMWord STMRead(Addr a) { 
   TransactionEntry* te = ActiveTrans.entries.Find(a); 
   if (null == te) { 
   // No entry in transaction descriptor. Create new entry (get value from memory) 
   // and add it to descriptor. 
      ValVersion vv = MemRead(a); 
      te = new TransactionEntry(a, vv.val, vv.version, vv.val, vv.version); 
      ActiveTrans.entries.Add(a, te); 
      return vv.val; 
   } else { 
   // Entry already exists in descriptor, so return its (possibly updated) value. 
      return te.newValue; 
   } 
}

void STMWrite(Addr a, STMWord w) { 
    STMRead(a); // Create entry if necessary. 
    TransactionEntry te = ActiveTrans.entries.Find(a); 
    te.newValue = w; 
    te.newVersion += 2; // Version numbers are odd numbers. 
}

The function MemRead returns the value of a memory location, along with its version number:

If no other transaction accessed the location and started committing, then the current value resides in the memory location and its ownership record contains the version number.
If another transaction accessed the location and committed, the value is in the transaction’s newValue field and the version in the newVersion field.
If another transaction accessed the location and has started, but not finished committing, the value is stored in the transaction’s oldValue field and the version in its oldVersion

STMCommit first acquires the ownership records for all locations accessed by the transaction. If successful, STMCommit changes the transaction’s state to COMMITTED, copies the modified values to memory, and releases the ownership records.
These three steps appear logically atomic to concurrent transactions because the committing transaction’s status changes atomically (and irrevocably) from ACTIVE to COMMITTED using an atomic read-modify-write operation. Once this change occurs, MemRead will return the updated value, even before the transaction copies value back to memory.


void STMCommit() { 
     // Only outermost nested transaction can commit. 
     if (AtomicAdd(ActiveTrans.nestingDepth, -1) != 0) { return; } 
     // A nested transaction already aborted this transaction. 
     if (ActiveTrans.status == TransactionStatus.ABORTED) { return; } 
       // Acquire ownership of all locations accessed by transaction. 
      int i; 
      for (i = 0; i < ActiveTrans.entries.Size(); i++) { 
           TransactionEntry* te = ActiveTrans.entries[i]; 
           switch (acquire(te)) { 
               case TRUE: { continue; } 
               case FALSE: { 
                       ActiveTrans.status = TransactionStatus.ABORTED; 
                       goto releaseAndReturn; 
               } 
               case BUSY: { /* conflict resolution */ } 
           } 
      } 
       // Transaction commits. 
      ActiveTrans.status = TransactionStatus.COMMITTED; 
        //Copy modified values to memory. 
      for (i = 0; i < ActiveTrans.entries.Size(); i++) { 
           TransactionEntry te = ActiveTrans.entries[i]; 
           *((STMWord*)te.loc) = te.newValue; 
      } 
   releaseAndReturn: // Release the ownership records. 
           for (int j = 0; j < i; j++) { release(te); } 
   } 
   bool acquire(TransactionEntry* te) { 
           OwnershipRec orec = CASOwnershipRec(te.loc, te.oldVersion, 
                                        ActiveTrans); 
           if (orec.HoldsVersion()) 
               { return orec.val.version == te.oldVersion; } 
           else { 
               if (orec.val.trans == ActiveTrans) { return true; } 
               else { return BUSY; } 
      } 
   } 

void release(TransactionEntry* te) { 
   if (ActiveTrans.status == TransactionStatus.COMMITTED) { 
       CASOwnershipRec(te.loc, ActiveTrans, te.newVersion); 
   } else { 
       CASOwnershipRec(te.loc, ActiveTrans, te.oldVersion); 
   } 
}

STMValidate is a read-only operation that checks the ownership records for each location accessed by the current transaction, to ensure that they are still consistent with the version the transaction initially read:


bool STMValidate() { 
    for (int i = 0; i < ActiveTrans.entries.Size(); i++) { 
        TransactionEntry* te = ActiveTrans.entries[i]; 
        OwnershipRec orec = FindOwnershipRec(te.loc); 
        if (orec.val.version != te.oldVersion) { return false; } 
    } 
    return true; 
}

STMWait can be used to implement a conditional critical region by suspending the transaction until its predicate should be reevaluated. It aborts the current transaction and waits until another transaction modifies a location accessed by the first transaction. It acquires ownership of the TransactionEntry accessed by the transaction, changes the transactions status to ASLEEP, and suspends the thread running the transaction. When another transaction updates one of these locations, it will conflict with the suspended transaction!!!
The conflict manager should allow the active transaction to complete execution and then resume the suspended transaction, which releases its ownership records and then retries the transaction:


void STMWait() { 
    int I; 
    for (i = 0; i < ActiveTrans.entries.Size(); i++) { 
          TransactionEntry* te = ActiveTrans.entries[i]; 
          switch (acquire(te)) { 
              case TRUE: { continue; } 
              case FALSE: { 
                   ActiveTrans.status = TransactionStatus.ABORTED; 
                   goto releaseAndReturn; 
              } 
              case BUSY: { /* conflict resolution */ } 
         } 
   } 
   // Transaction waits, unless in conflict with another transaction and 
   // needs to immediately re-execute. 
   ActiveTrans.status = TransactionStatus.ASLEEP; 
   SuspendThread(); 
   // Release the ownership records. 
   releaseAndReturn: 
       for (int j = 0; j < i; j++) { release(te); } 
}

If two transactions share a location that neither one modifies, one transaction will be aborted, since the system does not distinguish read-only locations from modified locations.
This performance issue is easily corrected. STMWrite can set a flag isModified in a transaction entry to record a modification of the location. STMCommit should acquire ownership of modified locations and validate unmodified locations! This introduces a new transaction status READ_PHASE. The transaction remains in this state until it commits.


void STMCommit() { 
   for (int i = 0; i < ActiveTrans.entries.Size(); i++) { 
      TransactionEntry* te = ActiveTrans.entries[i]; 
      if (te.isModified) { 
         switch (acquire(te)) { 
            case TRUE: { continue; } 
            case FALSE: { 
               ActiveTrans.status = TansactionStatus.ABORTED; 
               goto releaseAndReturn; 
            } 
            case BUSY: { /* conflict resolution */ } 
         } 
      } 
   } 
   ActiveTrans.status = TransactionStatus.READ_PHASE; 
   for (int i = 0; i < ActiveTrans.entries.Size(); i++) { 
      TransactionEntry* te = ActiveTrans.entries[i]; 
      if (!te.isModified) { 
         ValVersion vv = MemRead(te.loc); 
         if (te.oldVersion != vv.version) { 
         // Another transaction updated this location. 
            ActiveTrans.status = TransactionStatus.ABORTED; 
            goto releaseAndReturn; 
         } 
      } 
   } 
   // Transaction commits. Write modified values to memory. 
   ActiveTrans.status = TransactionStatus.COMMITTED; 
   for (int i = 0; i < ActiveTrans.entries.Size(); i++) { 
       TransactionEntry te = ActiveTrans.entries[i]; 
       *((STMWord*)te.loc) = te.newValue; 
   } 
   // Release the ownership records. 
releaseAndReturn: 
       for (int j = 0; j < i; j++) { release(te); } 
}

Reading 3. Deferred STM

2008-03-07T14:58:00.011+03:00

March 7, 2008

Deferred Update STM Herlihy, Luchangco, Moir, and Scherer, PODC 2003

DSTM Characteristics

Obstruction freedom
Explicit contention manager, which encapsulates the policy of resolving conflicts
Can release object, by reducing transaction readset

DSTM
Strong or Weak Isolation	Weak
Transaction granularity	Object
Update	Deferred (cloned replacement)
Concurrency control	Optimistic
Synchronization	Obstruction free
Conflict detection	Early
Incostistent reads	Validation
Conflicts resolution	Explicit content manager
Nested transaction	Flattened

A programmer must explicitly invoke library functions to create a transaction and to access shared objects. Transactions run on threads of a new class. The programmer must introduce and properly manipulate a container for each object involved in a transaction.
Example of using DSTM:

public bool insert(int v) {
List newList = new List(v);
TM0bject newNode = new TM0bject(newList);
TMThread thread = (TMThread)Thread.currentThread();
while (true) {
thread.beginTransaction();
bool result = true;
try {
   List prevList = (List)this.first. open(WRITE);
   List currList = (List)prevList.next. open(WRITE);
   while (eurrList.value <> v) {
       prevList = currList;
       currList = (List)currList.next. open(WRITE);
   }
   if (currList.value == v) { result = false; }
   else {
       result = true;
       newList.next = prevList.next;
       prevList.next = newNode;
   }
} catch (Denied d) {}
if (thread. commitTransaction()) {
   return result;
}
}

The TMThread class extends the Java Thread class:

class TMThread : Thread {
  void beginTransaction();
  bool commitTransaction();
  void abortTransaction();
}

Transaction references an object through a TMObject.The open operation prepares a TMObject to be manipulated by a transaction and exposes the underlying object to the code in the transaction. The actions that open performs depend on whether an object is open for reading or writing.

class TMObject {
  private class Locator {
    public Transaction trans;
    public Object oldVersion;
    public Object newVersion;
  }
  TMObject(Object obj);
  enum Mode { READ, WRITE };
  Object open(Mode mode);
}

The current version of an object is found through the object’s Locator.

If the Locator does not contain a transaction, the current version is the original object (oldVersion).

If the Locator points to some transaction, we check it`s status:
1. COMMITTED. The current version is the one modified by the
  transaction (newVersion).
2. ABORTED. The current version is the original object
  (oldVersion).
3. ACTIVE. Conflict! The contention manager must resolve the
  conflict by aborting or delaying one of the transactions.
  Note: it`s only the place, where we ask the contention manager! All other conflicts mean inconsistency and we have no choise: we have to abort current transaction!

DSTM adds two levels of indirection to an object:

READ - adds to current transaction read set pair (TMObject and currentVersion), then validate current transaction

// Record the TMObject and its current value (version) in transaction’s read table.
curTrans.recordRead(this, currentVersion(locInfo));
if (!curTrans.validate()) { throw new Denied(); }
return version;

WRITE - create new Locator, get current version of TMObject, clone it and validate

// Create a new Locator pointing to a local copy of the object and install it.
Locator newLocInfo = new Locator();
newLocInfo.trans = curTras;
// Actually it is just a spin lock to ensure, that no one has modified current
object`s locator
do {
  Locator oldLocInfo = locInfo;
  // Note: We can get conflict in currentVersion
  newLocInfo.oldVersion = currentVersion(oldLocInfo);
  newLocInfo.newVersion = newLocInfo.oldVersion.clone();
} while (CAS(locInfo, oldLocInfo, newLocInfo) != oldLocInfo);
if (!trans.validate()) { throw new Denied(); }
return newLocInfo.newVersion;

Validating the transaction’s consistency relies on the read set. DSTM compares each object entry in a transaction’s read set against the current version of the object (obtained by following the TMObject reference). If the objects differ, the transaction should abort since it is in inconsistent state.

A transaction commits by validating its read set, and if that operation succeeds, by
changing its status from ACTIVE to COMMITTED.

Note: We don`t need modify objects in memory! This modification(ACTIVE -> COMMITED) makes all of the transaction’s modified objects into the current version of the respective objects!

Clarification: Invalidation Policies

2008-03-02T13:59:00.005+03:00

I skimmed through the original paper by Michael Scott "Sequential Specification of Transactional Memory Semantics", which introduced classification of invalidation policies (lazy, eager W-R, mixed and eager), that was a stumbling block at our last reading.

Apparently our understanding of what was meant turns out to be correct: Scott introduces a notion of transactional memory history as essentially a sequence of events which include reading and writing memory and commiting and aborting transactions (each event is annotated with a transaction), that he says that predicate C(H,s,t) (where H is a history and s and t are transactions) is a conflict function if C(H,s,t) satisfies certain rules (mainly asserting that non-overlapping transactions do not conflict), and then he classifies conflict functions into lazy, eager W-R, mixed or eager depending on what kind of histories particular conflict function classifies as a conflict.

Garbage Collection vs. Transactional Memory

2008-03-02T10:53:00.005+03:00

Dan Grossman's paper "The Trasactional Memory / Garbage Collection Analogy" argues that

Transactional Memory is to shared-memory concurrency

as
Garbage Collection is to memory management

Here is a summarizing list of similiarities:

GC Term	TM Term
memory management	concurrency
dangling pointers	races
space exhaustion	deadlock
regions	locks
garbage collection	transactional memory
reachability	memory conflicts
nursery data	thread-local data
weak pointers	open nesting
I/O of pointers	I/O in trasactions
tracing	deferred update
automatic reference counting	direct update
conservative collection	false memory conflicts
real-time collection	obstruction freedom
liveness analysis	escape analysis

Be careful though: the analogy is a bad guide for studying TM. I believe we should first understand all the nuances and problems of implementing TM in its own right, and only then can we think of connections to Garbage Collection.

Mitya has the full text for the article (I believe ACM copyright allows to make copies for classroom use)

Reading 2. Taxonomy and Implementation

2008-02-25T02:08:00.012+03:00

February 29th, 2008

Granularity

Object granularity/Word granularity

Direct/Deferred Update

Deferred update: transactions modify private copies of objects, and copy to public space on commit
Direct update: transactions modify objects in place, revert modifications on rollback.
In STM, direct update appears to be faster.

Concurrency control

Conflict:

Occurs when transactions perform conflicting operation on memory location
Is detected when TM system is aware of that conflict
Is resolved when TM system takes action to ensure correctness (delays or aborts transaction)

These three events happen in that order, but at different times.
Pessimistic concurrency control: all three events happen at the same time.
Optimistic concurrency control: TM system postpones detection and resolution.

Progress Guarantees:

Wait freedom: all threads contending over a set of objects make forward progress in finite steps
Lock freedom: at least one thread contending over a set of objects make forward progress
Obstruction freedom: thread makes progress in the absense of contention over shared objects

Conflict detection

A conflict can be:

Detected on open: when transaction declares its intent of accessing an object
Detected on validation: at some point during transaction execution
Detected on commit: extreme case of validation, just before transaction commits (essentially a must, unless all conflicts are detected on open)

Validation should happen either by value or by version number (latter avoids ABA problem)

Early conflict detection may terminate the transaction that may have commited.
(TB and TC conflict with TA over two different objects...)

Late conflict detection discards more computation.

How to detect conflicts: either read/write sets (objects accessed by transaction; can be private or public) or reader/writer sets (transactions accessing objects).

Invalidation Policies

See Michael Scott's paper for a formal discussion

Lazy: TA and TB conflict if TA writes (an object), TB reads (the same object), and TA commits before TB
Eager W-R: Lazy or TA writes, TB reads, but neither commit
Mixed: Lazy or TA reads, TB reads and writes, neither commit
Eager: Eager W-R or TB reads, TA writes, neither commits

Lazy < Eager W-R < Eager
Lazy < Mixed < Eager

Doing something about conflicts

Validation: check the read set
Invalidation: check the reader set
Inconsistency toleration: allow transaction to continue in inconsistent state and recover from consequences (validate on throwing exception, timeout non-terminating loops &c)

Particulary bad example:
Thread 1:

ListNode res;
atomic {
  res = lHead;
  if (lHead != null)
    lHead = lHead.next;
}
use res several times

Thread 2:

atomic {
  ListNode n = lHead;
  while (n != null) {
    n.val++;
    n = n.next;
  }
}

If read set is private (only visible to a thread that keeps it) and inconsistency is tolerated, it is possible that thread 1 will see modified value and then unmodified value of res.val

Reading 1. Basic Concepts and Design Space

2008-02-25T01:19:00.007+03:00

February 22nd, 2008

Main Syntactic Construct


atomic {
  x.Bar();
  y = x.Baz(); // (1)
  y.Foo();
}

atomic {
  y = null; // (2)
}

TM system guarantees that code inside atomic blocks runs as if there are no other concurrent threads.
In the example, there is no data race between (1) and (2).
TM system should detect transaction memory conflicts and abort (and restart) one of the transactions.

Operational Semantics

Replace atomic with synchronized(MasterLock). The correct TM-system implementation should be equivalent.
Real-life system should do better!

TM is not a concurrent programming panacea:

bool flagA = false;
bool flagB = false;

// Th1:
atomic {
  while(!flagA);
  flagB = true;
}

// Th2
atomic {
  flagA = true;
  while(!flagB)
}

Th1 and Th2 deadlock! If we put smaller atomics, they will work.

Transaction Properties In TM System

TM-guaranteed properties:
Isolation - while transaction executes, no other transaction sees its changes, and vice versa.
Atomicity - transaction either does all changes it does to memory, or appears not to execute at all.
Not guaranteed:
Consistency - cannot be specified independently of a particular program

Exceptions

Exceptions thrown from inside atomic should commit transaction. Otherwise, complicated handling of exception object should be implemented.

Additional Features

retry - implementing conditional variables

atomic {
  if (buffer.IsEmpty()) retry;
  var value = buffer.GetFirst();
  ...
}

Will only work if sufficient progress guarantees are provided by TM system (wait freedom or lock freedom - see next lecture)

orElse

Weak Or Strong Isolation

Weak Isolation: transactions are only isolated from other transactions.
Strong Isolation: transactions are isolated from non-transactional code too (as if all memory accesses outide programmer-written atomics are surrounded in small atomics too)

Nested Transactions

Generally, when nested transaction commits, its changes are only seen by the parent transaction.

flattened nested transactions: aborting nested aborts its parent
closed nested transactions: abroting nested only aborts itself.

However, open nested transactions are useful.
open nested transactions: commit of nested open transaction is immediately seen by all.
Reduces conflicts (e.g. gensym())

Exceptions

Should commit transaction, otherewise non-trivial handling of exception object should be implemented.

Intro

2008-02-25T01:09:00.002+03:00

This is a blog for experimental Transactional Memory seminar at Math-Mech SPbSU.
We meet every Friday at 9:30 at 3502.
The idea is to read J.R. Larus, R.Rajwar Transactional Memory and underlying papers along the way.
(Mitya has an electronic version, Ilya Sergey has a paper copy)

We'll post schedule and announcements here, as well as some sort of lecture notes.