directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Selcuk AYA <>
Subject Re: A proposal for MVCC+ optimistic concurrency control transactional system(Re: Thoughts about DIRSERVER-1663)
Date Wed, 05 Oct 2011 12:38:54 GMT
On Wed, Oct 5, 2011 at 12:18 PM, Emmanuel Lécharny <> wrote:
> On 10/4/11 6:57 PM, Howard Chu wrote:
>> Selcuk AYA wrote:
>>> Hi All,
>>> this is a summary of what I have been working on and my approach to
>>> building a transactional system on top of partitions in ApacheDS. I
>>> have discussed part of what I summarize here with different people at
>>> different times.
> Thanks for that !
>>> * For the transactional system, optimistic concurrency control is
>>> chosen as workload is probably not update heavy.  MVCC concurrecy is
>>> chosen so that readers are not blocked.
> Ok.
>>> *We will enforce a data model where there is a master table and
>>> indices built from the master table. We can do put(key) operation on
>>> the master table and add/remove(key) operation on the index tables.
> We should also be able to do some remove() from the master table...
>>> Say we call these operations actions. We require each partition to
>>> expose its master table and index tables and implement the actions in
>>> an atomic way. We also require the scan of the master table and index
>>> tables to return data with read committed consistency where execution
>>> of each modification action is a  commit.
>>>             ** As an aside, I believe these consistency requirements
>>> are there in  HBASE(Stefan please correct me if this is not the case)
>>> and recent JDBM changes actually were targeted achieving a similar(
>>> but currently stronger) consistency.
>>> *There will an operational manager layer to handle operations using
>>> the master and index tables exposed by the partitions. Currently
>>> partitions have an interface which enforces each partition to
>>> implement the modifications in its own way. With these changes,
>>> partitions will just expose master and index tables and modifications
>>> will be done at the operation manager layer.
> Sounds good.
>>> *For optimistic concurrecy control, transactions keep an intention log
>>> as they execute. This intention log is built at the operational
>>> manager layer. When they are about to commit, they check whether they
>>> have any conflict with any of the committed transactions. If no, they
>>> commit, if yes, they abort and retry. To detect conflicts, we will
>>> keep the DN set that the txn touched. Also, intention log will be kept
>>> in memory.
> You didn't introduced the term 'transaction' before. I assume it covers
> every LDAP operation doing a modification, excluding search requests or
> lookups.
it includes search as well. Search will be a transaction with MVCC
consistency guarantee.

> When you write "we will keep the DN set that the Txn touched", we need to be
> a bit more specific : what about MODDN operations that move potentially
> millions of entries ? Do we keep the original and modified DNs only, or will
> we store all the moved entry's DN ? (I guess not, otherwise the system will
> simply not scale)
Say we move dn1 to dn2. We will keep dn1.* and dn2.* as the set of
changed dns. So if another txn tries to modify entry with dn1.dn3,
then it will conflict with dn1.*.

That said, we might have a problem if a modification changes huge
number of indices. This is because I am planning to keep the changes
to the indices in the unflushed part of the WAL logs to make forward
merge with the logs easier. MODDN is the operation that might
currently cause this problem because it updates the sublevel index.

>>> *when a txn commits, its intention log is appended to the loggin
>>> subsystem.
> What is the 'loggin subsystem' ? How does it differ from the intention log,
> except that may be the loggin subsystem is stored on disk ?

Logging subsystem is the Write Ahead Log.

>>> There will be a log management layer which will handle
>>> appending of data and syncing of log data to disk as txns commit.When
>>> a txn commits and appends its intention log, its changes are not
>>> immediately flushed to the underlying partitions. This is done later.
> So the mechanism will be :
> -1- start a txn (a modification request)
> -2- store the txn in the intention log
> -3- process it
> -4- when done, check that no other txn has modified the same entry
> -5- if not, commit the txn, append the txn's intention log into the loggin
> subsystem
> -6- otherwise, rollback the txn and move back to step 2
> -7- under the hood, some log management will sync the data on disk
> Is that correct ?
yes, nice summary:)

> If so, I have a Q. Suppose we have 2 txns on the same entry, T1 and T2. T1
> is executed before T2. Suppose also that T1 takes longer to process than T2,
> and that T2 will try to commit the txn : at step 4, we now will check if
> another txn isn't already trying to modify the same entry, and as T1 has
> been started before, we will get a failure. Fine. But then, why can't we
> check that when T2 is submitted and injected in intention log (step 2) ?

actually who fails is determined by commit time. So the execution
history you describe is like this:

T1 starts, T2 starts,...., T2 tries to commit, T1 tries to commit

assuming no other txn tried to commit, then T2 would commit in this
case and T1 would be rejected.

There would be ways to detect the conflict early but detecting the
conflict at commit time works.

>>> * Suppose we have 3 txns T1, T2 and T3 where T1 and T3 are "commited"
>>> write txns and their changes are not flushed to partitions yet. T2 is
>>> a read only txn.
> Hmmm. Seems that teh definition of a transaction I gave above is not
> correct...
>>> T2 started before T3 and after T1. Then we avoid
>>> flushing changes of T3 to partitions as long as T2 goes on. Then to
>>> get a consistent view, T2 should merge what it reads from the
>>> partitions with the changes of T1. This handles MVCC. To make merging
>>> and flushing of data easier, an unflushed txn's changes are kept in
>>> memory.
>> This sounds strange - why are the changes not yet flushed, if T1 is
>> already committed?
> AFAIU, because the flush is a differed operation, and because there is a T2
> txn using the data. I guess that T3 will be flushed to the backend as soon
> as T2 terminate. Right ?
>>> * To handle search as above, search engine puts decorators around the
>>> index and master tables that the partitions expose. These decorators
>>> read data from the partitions and then consult the txn log to get a
>>> consistent view of the index and master tables. Again search is
>>> implemented outside the partition, partitions just implement master
>>> and index tables.
> Ok, fine.
> Continuing the discussion on Howard's comments on my initial mail below...
>>> <snip/>
>>>> MVCC is a system which keeps old versions of elements until they aren't
>>>> needed anymore. For instance, when we do a search, we will browse some
>>>> entries using their IDs, provided by an index. When we start the search,
>>>> we
>>>> select the best possible index to browse the entries, and we get back a
>>>> set
>>>> of IDs. If we associate this operation with an unique transaction ID, we
>>>> must guarantee that all the IDs from the set will be present until the
>>>> cursor is totally read (or the search cancelled). If a modification is
>>>> done
>>>> on one of the entry associated with one of those IDs, then we still
>>>> should
>>>> be able to access to the previous entry. Such a modification must create
>>>> a
>>>> copy of the entry itself, but also of all the tuples in the indexes,
>>>> associated with a revision number. The incoming transaction will use
>>>> this
>>>> revision number to get an immutable IDs set.
>> Sounds like you plan to do MVCC at the logical record level. In OpenLDAP's
>> MDB it's done at the database page level. I think, particularly when you can
>> have multiple records residing on the same DB page, that managing pages is
>> easier.
> The big advantage of implementing MVCC higher on the stack is that we can
> use non-MVCC base below. It's not perfect, and it would be great if we could
> have a MVCC native backend, but this is a first step.
>>>> Now, at some point, that will create a hell lots of new entries and
>>>> tuples
>>>> in index tables. We must implement a system to clean up those duplicates
>>>> once they are not in use. There are two ways to handle such a clean up :
>>>> - keep all the duplicates in the backend, removing them when no
>>>> operation is
>>>> associated with the old revision
>> MDB does this. Also by keeping all old data around, you eliminate the need
>> for a write-ahead transaction log, so again, one less thing to manage.
> But you trade this by needing a system that cleanup the dead revisions.
>>>> - or create a rollback table, where the old elements are stored, with a
>>>> limited size
>>>> The second solution is what Oracle is using. It's efficient, except when
>>>> you
>>>> have to grab old revisions, as you don't have to update the main
>>>> database.
>>>> All the old elements are simply pushed into this rollback table
>>>> (rollback
>>>> segment), and are available as long as they are not pushed out by newer
>>>> elements (the table has a limited size).
>>>> Postgresql has implemented the first solution. The biggest advantage is
>>>> that
>>>> you can't have an error, but the database may be huge. You also need a
>>>> thread to do the cleanup.
>> Cleanup doesn't actually require a dedicated thread; a writer can simply
>> check that there are unused pages older than the oldest reader, and start
>> grabbing those pages to fulfill its write. This is what MDB does.
> True. But in any case, you still have to do the cleanup, and it can be a
> costly process. Well, managing the log is also costly. Rahhh, nothing come
> for free those days :/
>>>> In any case, I just wanted to initiate a discussion about this problem
>>>> and
>>>> the potential solutions, so feel free to add your vision and knowledge
>>>> in
>>>> your response. It would be valuable to define a roadmap for such an
>>>> implementation, and to discuss the different steps before diving into
>>>> the
>>>> code...
>> This would be a good conversation for Heidelberg ;)  I can go into full
>> detail after the MDB presentation.
> Sure ! Btw, I'm arriving with Pierre-Arnaud on sunday afternoon ( around
> 4pm).
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny

View raw message