directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Howard Chu <>
Subject Re: A proposal for MVCC+ optimistic concurrency control transactional system(Re: Thoughts about DIRSERVER-1663)
Date Tue, 04 Oct 2011 16:57:29 GMT
Selcuk AYA wrote:
> Hi All,
> this is a summary of what I have been working on and my approach to
> building a transactional system on top of partitions in ApacheDS. I
> have discussed part of what I summarize here with different people at
> different times.
> * For the transactional system, optimistic concurrency control is
> chosen as workload is probably not update heavy.  MVCC concurrecy is
> chosen so that readers are not blocked.
> *We will enforce a data model where there is a master table and
> indices built from the master table. We can do put(key) operation on
> the master table and add/remove(key) operation on the index tables.
> Say we call these operations actions. We require each partition to
> expose its master table and index tables and implement the actions in
> an atomic way. We also require the scan of the master table and index
> tables to return data with read committed consistency where execution
> of each modification action is a  commit.
>              ** As an aside, I believe these consistency requirements
> are there in  HBASE(Stefan please correct me if this is not the case)
> and recent JDBM changes actually were targeted achieving a similar(
> but currently stronger) consistency.
> *There will an operational manager layer to handle operations using
> the master and index tables exposed by the partitions. Currently
> partitions have an interface which enforces each partition to
> implement the modifications in its own way. With these changes,
> partitions will just expose master and index tables and modifications
> will be done at the operation manager layer.
> *For optimistic concurrecy control, transactions keep an intention log
> as they execute. This intention log is built at the operational
> manager layer. When they are about to commit, they check whether they
> have any conflict with any of the committed transactions. If no, they
> commit, if yes, they abort and retry. To detect conflicts, we will
> keep the DN set that the txn touched. Also, intention log will be kept
> in memory.
> *when a txn commits, its intention log is appended to the loggin
> subsystem. There will be a log management layer which will handle
> appending of data and syncing of log data to disk as txns commit.When
> a txn commits and appends its intention log, its changes are not
> immediately flushed to the underlying partitions. This is done later.
> * Suppose we have 3 txns T1, T2 and T3 where T1 and T3 are "commited"
> write txns and their changes are not flushed to partitions yet. T2 is
> a read only txn. T2 started before T3 and after T1. Then we avoid
> flushing changes of T3 to partitions as long as T2 goes on. Then to
> get a consistent view, T2 should merge what it reads from the
> partitions with the changes of T1. This handles MVCC. To make merging
> and flushing of data easier, an unflushed txn's changes are kept in
> memory.

This sounds strange - why are the changes not yet flushed, if T1 is already 

> * To handle search as above, search engine puts decorators around the
> index and master tables that the partitions expose. These decorators
> read data from the partitions and then consult the txn log to get a
> consistent view of the index and master tables. Again search is
> implemented outside the partition, partitions just implement master
> and index tables.
> please let me know if you have any questions or comments,
> regards,
> Selcuk
> On Mon, Oct 3, 2011 at 12:18 PM, Emmanuel Lecharny<>  wrote:
>> Hi guys,
>> this error is a pretty annoying one. We had a convo with Selcuk last friday
>> about it, which is sum up here.
>> Basically, what happens is that when we have multiple threads doing a search
>> while some other are adding/deleting some entries which are potentially part
>> of the returned results, we get some NPE. This is due to the fact that we
>> use a cursor on an index which uses IDs of entries that can have been
>> removed when we try to read them.
>> The discussion we had led to the fact that we need to implement a
>> transaction system to protect the client from such problem. This can
>> probably be implemented on top of what we have, even if it kills the
>> performances.
>> OTOH, at some point, what we really need is to implement a MVCC system on
>> top of the backend.
>> MVCC is a system which keeps old versions of elements until they aren't
>> needed anymore. For instance, when we do a search, we will browse some
>> entries using their IDs, provided by an index. When we start the search, we
>> select the best possible index to browse the entries, and we get back a set
>> of IDs. If we associate this operation with an unique transaction ID, we
>> must guarantee that all the IDs from the set will be present until the
>> cursor is totally read (or the search cancelled). If a modification is done
>> on one of the entry associated with one of those IDs, then we still should
>> be able to access to the previous entry. Such a modification must create a
>> copy of the entry itself, but also of all the tuples in the indexes,
>> associated with a revision number. The incoming transaction will use this
>> revision number to get an immutable IDs set.

Sounds like you plan to do MVCC at the logical record level. In OpenLDAP's MDB 
it's done at the database page level. I think, particularly when you can have 
multiple records residing on the same DB page, that managing pages is easier.

>> Now, at some point, that will create a hell lots of new entries and tuples
>> in index tables. We must implement a system to clean up those duplicates
>> once they are not in use. There are two ways to handle such a clean up :
>> - keep all the duplicates in the backend, removing them when no operation is
>> associated with the old revision

MDB does this. Also by keeping all old data around, you eliminate the need for 
a write-ahead transaction log, so again, one less thing to manage.

>> - or create a rollback table, where the old elements are stored, with a
>> limited size
>> The second solution is what Oracle is using. It's efficient, except when you
>> have to grab old revisions, as you don't have to update the main database.
>> All the old elements are simply pushed into this rollback table (rollback
>> segment), and are available as long as they are not pushed out by newer
>> elements (the table has a limited size).
>> Postgresql has implemented the first solution. The biggest advantage is that
>> you can't have an error, but the database may be huge. You also need a
>> thread to do the cleanup.

Cleanup doesn't actually require a dedicated thread; a writer can simply check 
that there are unused pages older than the oldest reader, and start grabbing 
those pages to fulfill its write. This is what MDB does.

>> In any case, I just wanted to initiate a discussion about this problem and
>> the potential solutions, so feel free to add your vision and knowledge in
>> your response. It would be valuable to define a roadmap for such an
>> implementation, and to discuss the different steps before diving into the
>> code...

This would be a good conversation for Heidelberg ;)  I can go into full detail 
after the MDB presentation.

   -- Howard Chu
   CTO, Symas Corp. 
   Director, Highland Sun
   Chief Architect, OpenLDAP

View raw message