directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Selcuk AYA <ayasel...@gmail.com>
Subject A proposal for MVCC+ optimistic concurrency control transactional system(Re: Thoughts about DIRSERVER-1663)
Date Tue, 04 Oct 2011 09:40:24 GMT
Hi All,
this is a summary of what I have been working on and my approach to
building a transactional system on top of partitions in ApacheDS. I
have discussed part of what I summarize here with different people at
different times.

* For the transactional system, optimistic concurrency control is
chosen as workload is probably not update heavy.  MVCC concurrecy is
chosen so that readers are not blocked.

*We will enforce a data model where there is a master table and
indices built from the master table. We can do put(key) operation on
the master table and add/remove(key) operation on the index tables.
Say we call these operations actions. We require each partition to
expose its master table and index tables and implement the actions in
an atomic way. We also require the scan of the master table and index
tables to return data with read committed consistency where execution
of each modification action is a  commit.
            ** As an aside, I believe these consistency requirements
are there in  HBASE(Stefan please correct me if this is not the case)
and recent JDBM changes actually were targeted achieving a similar(
but currently stronger) consistency.

*There will an operational manager layer to handle operations using
the master and index tables exposed by the partitions. Currently
partitions have an interface which enforces each partition to
implement the modifications in its own way. With these changes,
partitions will just expose master and index tables and modifications
will be done at the operation manager layer.

*For optimistic concurrecy control, transactions keep an intention log
as they execute. This intention log is built at the operational
manager layer. When they are about to commit, they check whether they
have any conflict with any of the committed transactions. If no, they
commit, if yes, they abort and retry. To detect conflicts, we will
keep the DN set that the txn touched. Also, intention log will be kept
in memory.

*when a txn commits, its intention log is appended to the loggin
subsystem. There will be a log management layer which will handle
appending of data and syncing of log data to disk as txns commit.When
a txn commits and appends its intention log, its changes are not
immediately flushed to the underlying partitions. This is done later.

* Suppose we have 3 txns T1, T2 and T3 where T1 and T3 are "commited"
write txns and their changes are not flushed to partitions yet. T2 is
a read only txn. T2 started before T3 and after T1. Then we avoid
flushing changes of T3 to partitions as long as T2 goes on. Then to
get a consistent view, T2 should merge what it reads from the
partitions with the changes of T1. This handles MVCC. To make merging
and flushing of data easier, an unflushed txn's changes are kept in
memory.

* To handle search as above, search engine puts decorators around the
index and master tables that the partitions expose. These decorators
read data from the partitions and then consult the txn log to get a
consistent view of the index and master tables. Again search is
implemented outside the partition, partitions just implement master
and index tables.

please let me know if you have any questions or comments,
regards,
Selcuk

On Mon, Oct 3, 2011 at 12:18 PM, Emmanuel Lecharny <elecharny@gmail.com> wrote:
> Hi guys,
>
> this error is a pretty annoying one. We had a convo with Selcuk last friday
> about it, which is sum up here.
>
> Basically, what happens is that when we have multiple threads doing a search
> while some other are adding/deleting some entries which are potentially part
> of the returned results, we get some NPE. This is due to the fact that we
> use a cursor on an index which uses IDs of entries that can have been
> removed when we try to read them.
>
> The discussion we had led to the fact that we need to implement a
> transaction system to protect the client from such problem. This can
> probably be implemented on top of what we have, even if it kills the
> performances.
>
> OTOH, at some point, what we really need is to implement a MVCC system on
> top of the backend.
>
> MVCC is a system which keeps old versions of elements until they aren't
> needed anymore. For instance, when we do a search, we will browse some
> entries using their IDs, provided by an index. When we start the search, we
> select the best possible index to browse the entries, and we get back a set
> of IDs. If we associate this operation with an unique transaction ID, we
> must guarantee that all the IDs from the set will be present until the
> cursor is totally read (or the search cancelled). If a modification is done
> on one of the entry associated with one of those IDs, then we still should
> be able to access to the previous entry. Such a modification must create a
> copy of the entry itself, but also of all the tuples in the indexes,
> associated with a revision number. The incoming transaction will use this
> revision number to get an immutable IDs set.
>
> Now, at some point, that will create a hell lots of new entries and tuples
> in index tables. We must implement a system to clean up those duplicates
> once they are not in use. There are two ways to handle such a clean up :
> - keep all the duplicates in the backend, removing them when no operation is
> associated with the old revision
> - or create a rollback table, where the old elements are stored, with a
> limited size
>
> The second solution is what Oracle is using. It's efficient, except when you
> have to grab old revisions, as you don't have to update the main database.
> All the old elements are simply pushed into this rollback table (rollback
> segment), and are available as long as they are not pushed out by newer
> elements (the table has a limited size).
>
> Postgresql has implemented the first solution. The biggest advantage is that
> you can't have an error, but the database may be huge. You also need a
> thread to do the cleanup.
>
> In any case, I just wanted to initiate a discussion about this problem and
> the potential solutions, so feel free to add your vision and knowledge in
> your response. It would be valuable to define a roadmap for such an
> implementation, and to discuss the different steps before diving into the
> code...
>
> Thanks !
>
> --
> Regards,
> Cordialement,
> Emmanuel L├ęcharny
> www.iktek.com
>
>

Mime
View raw message