directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <>
Subject Re: Txn discussion
Date Sun, 10 Jun 2012 00:00:31 GMT
Le 6/9/12 11:46 PM, Selcuk AYA a écrit :
> Lets say we sacrifice cross partition txns. I think that is OK.

It's not a sacrifice. I see it as if we decide to postpone it atm.
> On Sat, Jun 9, 2012 at 10:45 PM, Howard Chu<>  wrote:
>> Emmanuel Lécharny wrote:
>>> Hi guys,
>>> independently from the ongoing work on the txn layer, I'd like to start
>>> a thread of discussion about the path we selected, and the other
>>> possible options.
>>> Feel free to express your opinion here, I'll create a few items I'd liek
>>> to see debated.
>>> 1) Introduction
>>> We badly need to have a consistent system. The fact is that the current
>>> trunk - and I guess this is true for all the released we have done so
>>> far) suffers from some serious issue when multiple modifications are
>>> done during searches. The reason is that we depend on a BTree
>>> implementation that exposes a data structure directly reading the pages
>>> containing the data, expecting those pages to remain unchanged in the
>>> ong run. Obviously, when we browse more than one entry, we are likely to
>>> see a modification changing the data...
>>> 2) txn layer
>>> There are a few way to get this problem solved :
>>> - we can have a MVCC backend, and a protection against concurrent
>>> modifications. Any read will always succeed, as each read will use a
>>> revision and only one.
> Lets say we want to implement a txn system within JDBM. We have to
> implement this not within a singel B+ tree but across B+ trees.
Yes. But that does not really matter, as soon as two modifications can't 
occur concurrently.
> How
> will this be different from what we are trying to implement now? We
> still need a WAL log keeping track of txns on top of B+ trees, changes
> could be kept track of in terms of pages or entries and indices. Old
> version of data has to be copied over to some other location before
> newer version can overwrite it or newer version has to be kept at
> location X as long as readers need the old data. Any MVCC system has
> to do something like this.
No, we don't need all this mechanism if we block all the modifications 
while a modification is being processed. I agree that modifications will 
be slower, but this is a price I want to pay if, at the same time, I can 
guarantee consistant *and* concurrent reads.
> For us, newer version of data is kept at WAL as long as a reader needs
> the old version of data. As explained below, for simplicity we keep a
> copy of WAL in memory in a format that makes merging data for readers
> easier and faster. More on this below.
> I think what we implement right now is not very different from what we
> would implement inside a single partition.
With a single partition, I don't need to keep anything in memory, 
assuming I serialize the modifications.
>>> - we can also read fast the results and store them somwhere, blocking
>>> the modification until the read is finished.
>>> - or we can keep a copy of the modified elements within the original
>>> elements, until the seraches that use those elements are finished.
>>> (there are probably some other solutions, but I don't know them)
>>> AFAICT, the transaction branch is implementing the third solution,
>>> keepong the copy of modified elements in memory, so that they can be
>>> sent back to the user.
> it is true that the current txn system makes use of in memory copies
> for fast merge of data. However, what it really does it it just keeps
> a copy of txn wal log in memory. This can be extended to discard the
> in memory copy and directly read from the WAL when memory exceeds some
> threshold for example. Implementing read from memory was just easier.
> Also think of adding another partition tomorrow. Say HBASE partition
> is added which exposes atomic writes and atomic reads or scan
> consistent scans. If we plug that partition with what we are
> implementing right now, txns over HBASE partitions would just work
> without much effort.
Yes. What you have written is also a way to keep partition dumb. What 
I'm suggesting forces you to have MVCC copable partitions, which is a 
real hassle. Now, let's face it : do we need anything else, atm ? Plus 
HBase already implement a similar system to protect reads against 
conncurrent modifications, so we don't necessarily need to have it.
Also keep in mind that if we want to implement the solution I proposed, 
we still need to modify the code to protect the partitions against 
concurrent modifications, and to leverage the MVCC parts in JDBM (and 
probably write the versions on disk too).

Emmanuel Lécharny

View raw message