sds_bptree_cow_instance - Man Page


#include <sds.h>

Data Fields

uint32_t checksum
sds_bptree_instance * bi
sds_bptree_transaction * txn
sds_bptree_transaction * tail_txn
uint64_t txn_count
pthread_rwlock_t * read_lock
pthread_mutex_t * write_lock

Detailed Description

The copy on write tree instance. This stores the current active reading transaction, number of active transactions, and pointers to the root of the b+tree.

This instance is the most important part of the tree and provides us with memory safety. Once we have memory safety, individual transactions are free to garbage collect on their own. Garbage collection is coordinated from the instance.

Memory Saftey

Within the transaction, we have a small number of important fields. These are:

  • read_lock
  • write_lock
  • tail_txn

Read Transaction Begin

At the beginning of a read transaction, we take a rolock on read_lock. This ensures that on our working thread, that the correct sequence of barriers for memory consistency of the tree is issued.

Within the read_lock, we use cpu atomics to increment the transaction reference count by 1. We then release the read_lock.

In this way, we allow multiple read transactions to be initiated at the same time, with the only serialisation point being the atomic increment.

While the read transaction is 'active' (IE the transaction that is listed in binst, where new transactions will reference), the reference count will never go below 1.

Write Transaction

At the beginning of a write transaction, we take the write_lock mutex. This guarantees we are the only writer in the tree at a point in time. Due to the way mutexes work, this guarantees our memory barries of the tree are issued, so we see all other writes that have occured.

During a write, we store a list of all nodes that were cloned: Both so that we know who we cloned from but aso what nodes we created.

Write Abort

During a write abort, we release the write lock immediately. We then use the created node list, and free all contents as they were never made active.

Write commit

During a commit, we prepare the transaction by changing it's state flags, and setting it's reference count atomically. We relinquish the created list, because we have no need to free the nodes that are about to become part of the tree. However, we update the previous transaction with the 'owned' list, as these are nodes we cloned from: IE these are nodes that serve no future role in the tree, so the previous transaction is the last point where they are valid and needed - The last transaction now must clean up after itself when it's ready to GC!

We set our reference count to 2, to indicate that a following node (the previous active transaction) relies on us, and that we are also active.

We now take the write variant of the rw read_lock. This waits for all in progress read transactions to begin, and we take exclusive access of this lock. We then pivot the older transaction out with our transaction. The read_lock is released, and read transactions may now resume.

At this point, we release the write_lock, and a new write may begin.

We decrement the reference count of the previous transaction which may cause it to drop to 0. This is the equivalent of a read transaction close. After this point, the previous transactions reference count will only decrement, as previous holders close their transactions. Increment only occurs from the active transaction!

Read transaction close.

When we close a read transaction, we require no locks. We use cpu atomic to set and fetch the reference count.

If the reference count remains positive, we complete, and continue.

If the reference count drops to 0, we take a reference to the next transaction in the series, and we free our node. We then decrement the reference counter of the next transaction and check the result.

Either, the reference count is > 0 because it is the active transaction or there is a current read transaction.

OR, the reference count reaches 0 because there are no active transactions, so we can free this node. in the same operation, we then decrement the reference count to the next node in the series, and we repeat this til we reach a node with a positive reference count.

Due to the design of this system, we are guarantee no thread races, as one and only one thread can ever set the reference count to 0: either the closing read, or the closing previous transaction. This gives us complete thread safety, but without the need for a mutex to ensure this.

Usage of the memory safe properties.

Due to the way the barriers are issued, when you have a read transaction or a write transaction, you can guarantee that the tree memory is consistent and correct, and that all values will remain alive until you close the transaction. This means you can use this to allow parallel tasks in complex cases. Directory Server has such a case with the plugin subsystem. Consider every operation needs a consistent list of plugins.

You build a tree of the plugin references (dlopen pointers, reference count, pblock) and store them.

When you begin the operation you take a read transaction of this. This means every plugin in the operation is guaranteed to be in the tree and held open for the duration of this operation. During the read, if another thread commits the delete on a plugin, this drops the ref count, but not below 0 - the operation can continue safely.

Once the operation is complete, the transaction is closed. At this point, the vacuum runs during close, which would trigger the value free function. This would now actually close the plugin pointers, values. Additionally, the correct use of the transaction guarantees the memory content of the plugin is correct due to the barriers inside of the transaction.

Field Documentation

sds_bptree_instance* sds_bptree_cow_instance::bi

Pointer to the current b+tree instance. This contains our function pointers for free, dup of key and values.

uint32_t sds_bptree_cow_instance::checksum

checksum of the instance data.

pthread_rwlock_t* sds_bptree_cow_instance::read_lock

The transaction read lock. This is used to prevent corruption of the txn pointer during a read or write transaction acquisition.

sds_bptree_transaction* sds_bptree_cow_instance::tail_txn

The pointer to the current 'oldest' alive transaction. This is needed for correct cleanup without corrupting intermediate trees.

Consider we have a set of transactions, such as A -> B -> C. Between each a set of changes has occured. We assume that if we have branches in these, then during a cow operation they can be updated. So we end up in this situation.

   [ A ]         [ B ]         [ C ]
     |             |             |
 [ root a ]    [ root b ]    [ root c ]
  /     \\       /     \\       /     \\
[ a ]  [ x ]  [ a ]  [ b ]  [ c ]  [ b ]

Now, when we clone txn B, because C performed a cow on the leaf A, txn B 'owns' leaf A. At this point, leaf A will be freed, which frees it from underneath transaction A. We have corrupted the tree of a previous node!

To prevent this, we only truly free a transaction when the ref count is 0 AND it's the last transaction in the list of live transactions. This way, B even though ref count reached 0, would not be freed til A is also freed.

sds_bptree_transaction* sds_bptree_cow_instance::txn

The pointer to the current 'latest' commited transaction. When a new read transaction is taken, this is the value that is returned (until a new write transaction is commited.)

uint64_t sds_bptree_cow_instance::txn_count

The number of active transactions in the system.

pthread_mutex_t* sds_bptree_cow_instance::write_lock

The transaction write lock. During a write transaction, this is held for the duration to allow only a single writer.


Generated automatically by Doxygen for dirsrv from the source code.


2.4.3" dirsrv