directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <elecha...@gmail.com>
Subject Re: Txn discussion
Date Sun, 10 Jun 2012 00:00:31 GMT
Le 6/9/12 11:46 PM, Selcuk AYA a écrit :
> Lets say we sacrifice cross partition txns. I think that is OK.

It's not a sacrifice. I see it as if we decide to postpone it atm.
>
>
>
> On Sat, Jun 9, 2012 at 10:45 PM, Howard Chu<hyc@symas.com>  wrote:
>> Emmanuel Lécharny wrote:
>>> Hi guys,
>>>
>>> independently from the ongoing work on the txn layer, I'd like to start
>>> a thread of discussion about the path we selected, and the other
>>> possible options.
>>>
>>> Feel free to express your opinion here, I'll create a few items I'd liek
>>> to see debated.
>>>
>>> 1) Introduction
>>>
>>> We badly need to have a consistent system. The fact is that the current
>>> trunk - and I guess this is true for all the released we have done so
>>> far) suffers from some serious issue when multiple modifications are
>>> done during searches. The reason is that we depend on a BTree
>>> implementation that exposes a data structure directly reading the pages
>>> containing the data, expecting those pages to remain unchanged in the
>>> ong run. Obviously, when we browse more than one entry, we are likely to
>>> see a modification changing the data...
>>>
>>> 2) txn layer
>>>
>>> There are a few way to get this problem solved :
>>> - we can have a MVCC backend, and a protection against concurrent
>>> modifications. Any read will always succeed, as each read will use a
>>> revision and only one.
> Lets say we want to implement a txn system within JDBM. We have to
> implement this not within a singel B+ tree but across B+ trees.
Yes. But that does not really matter, as soon as two modifications can't 
occur concurrently.
> How
> will this be different from what we are trying to implement now? We
> still need a WAL log keeping track of txns on top of B+ trees, changes
> could be kept track of in terms of pages or entries and indices. Old
> version of data has to be copied over to some other location before
> newer version can overwrite it or newer version has to be kept at
> location X as long as readers need the old data. Any MVCC system has
> to do something like this.
No, we don't need all this mechanism if we block all the modifications 
while a modification is being processed. I agree that modifications will 
be slower, but this is a price I want to pay if, at the same time, I can 
guarantee consistant *and* concurrent reads.
>
> For us, newer version of data is kept at WAL as long as a reader needs
> the old version of data. As explained below, for simplicity we keep a
> copy of WAL in memory in a format that makes merging data for readers
> easier and faster. More on this below.
>
> I think what we implement right now is not very different from what we
> would implement inside a single partition.
With a single partition, I don't need to keep anything in memory, 
assuming I serialize the modifications.
>>> - we can also read fast the results and store them somwhere, blocking
>>> the modification until the read is finished.
>>> - or we can keep a copy of the modified elements within the original
>>> elements, until the seraches that use those elements are finished.
>>>
>>> (there are probably some other solutions, but I don't know them)
>>>
>>> AFAICT, the transaction branch is implementing the third solution,
>>> keepong the copy of modified elements in memory, so that they can be
>>> sent back to the user.
> it is true that the current txn system makes use of in memory copies
> for fast merge of data. However, what it really does it it just keeps
> a copy of txn wal log in memory. This can be extended to discard the
> in memory copy and directly read from the WAL when memory exceeds some
> threshold for example. Implementing read from memory was just easier.
>
> Also think of adding another partition tomorrow. Say HBASE partition
> is added which exposes atomic writes and atomic reads or scan
> consistent scans. If we plug that partition with what we are
> implementing right now, txns over HBASE partitions would just work
> without much effort.
Yes. What you have written is also a way to keep partition dumb. What 
I'm suggesting forces you to have MVCC copable partitions, which is a 
real hassle. Now, let's face it : do we need anything else, atm ? Plus 
HBase already implement a similar system to protect reads against 
conncurrent modifications, so we don't necessarily need to have it.
Also keep in mind that if we want to implement the solution I proposed, 
we still need to modify the code to protect the partitions against 
concurrent modifications, and to leverage the MVCC parts in JDBM (and 
probably write the versions on disk too).

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com


Mime
View raw message