directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Selcuk AYA <>
Subject Re: Txn discussion
Date Sun, 10 Jun 2012 00:48:06 GMT
On Sun, Jun 10, 2012 at 3:00 AM, Emmanuel Lécharny <> wrote:
> Le 6/9/12 11:46 PM, Selcuk AYA a écrit :
>> Lets say we sacrifice cross partition txns. I think that is OK.
> It's not a sacrifice. I see it as if we decide to postpone it atm.
>> On Sat, Jun 9, 2012 at 10:45 PM, Howard Chu<>  wrote:
>>> Emmanuel Lécharny wrote:
>>>> Hi guys,
>>>> independently from the ongoing work on the txn layer, I'd like to start
>>>> a thread of discussion about the path we selected, and the other
>>>> possible options.
>>>> Feel free to express your opinion here, I'll create a few items I'd liek
>>>> to see debated.
>>>> 1) Introduction
>>>> We badly need to have a consistent system. The fact is that the current
>>>> trunk - and I guess this is true for all the released we have done so
>>>> far) suffers from some serious issue when multiple modifications are
>>>> done during searches. The reason is that we depend on a BTree
>>>> implementation that exposes a data structure directly reading the pages
>>>> containing the data, expecting those pages to remain unchanged in the
>>>> ong run. Obviously, when we browse more than one entry, we are likely to
>>>> see a modification changing the data...
>>>> 2) txn layer
>>>> There are a few way to get this problem solved :
>>>> - we can have a MVCC backend, and a protection against concurrent
>>>> modifications. Any read will always succeed, as each read will use a
>>>> revision and only one.
>> Lets say we want to implement a txn system within JDBM. We have to
>> implement this not within a singel B+ tree but across B+ trees.
> Yes. But that does not really matter, as soon as two modifications can't
> occur concurrently.
>> How
>> will this be different from what we are trying to implement now? We
>> still need a WAL log keeping track of txns on top of B+ trees, changes
>> could be kept track of in terms of pages or entries and indices. Old
>> version of data has to be copied over to some other location before
>> newer version can overwrite it or newer version has to be kept at
>> location X as long as readers need the old data. Any MVCC system has
>> to do something like this.
> No, we don't need all this mechanism if we block all the modifications while
> a modification is being processed. I agree that modifications will be
> slower, but this is a price I want to pay if, at the same time, I can
> guarantee consistant *and* concurrent reads.

you have a single modification that touches a couple of entries and
indices, how will reads proceed concurrently if the ongoing
modification does not pay attention to not overwriting the versions
the reads are using ?
>> For us, newer version of data is kept at WAL as long as a reader needs
>> the old version of data. As explained below, for simplicity we keep a
>> copy of WAL in memory in a format that makes merging data for readers
>> easier and faster. More on this below.
>> I think what we implement right now is not very different from what we
>> would implement inside a single partition.
> With a single partition, I don't need to keep anything in memory, assuming I
> serialize the modifications.
>>>> - we can also read fast the results and store them somwhere, blocking
>>>> the modification until the read is finished.
>>>> - or we can keep a copy of the modified elements within the original
>>>> elements, until the seraches that use those elements are finished.
>>>> (there are probably some other solutions, but I don't know them)
>>>> AFAICT, the transaction branch is implementing the third solution,
>>>> keepong the copy of modified elements in memory, so that they can be
>>>> sent back to the user.
>> it is true that the current txn system makes use of in memory copies
>> for fast merge of data. However, what it really does it it just keeps
>> a copy of txn wal log in memory. This can be extended to discard the
>> in memory copy and directly read from the WAL when memory exceeds some
>> threshold for example. Implementing read from memory was just easier.
>> Also think of adding another partition tomorrow. Say HBASE partition
>> is added which exposes atomic writes and atomic reads or scan
>> consistent scans. If we plug that partition with what we are
>> implementing right now, txns over HBASE partitions would just work
>> without much effort.
> Yes. What you have written is also a way to keep partition dumb. What I'm
> suggesting forces you to have MVCC copable partitions, which is a real
> hassle. Now, let's face it : do we need anything else, atm ? Plus HBase
> already implement a similar system to protect reads against conncurrent
> modifications, so we don't necessarily need to have it.
> Also keep in mind that if we want to implement the solution I proposed, we
> still need to modify the code to protect the partitions against concurrent
> modifications, and to leverage the MVCC parts in JDBM (and probably write
> the versions on disk too).

no. HBASE is not transactional. You still need transactions to make
queries consistent.
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny

View raw message