On Sat, Jun 9, 2012 at 6:22 PM, Emmanuel Lécharny <elecharny@gmail.com> wrote:
Hi guys,

independently from the ongoing work on the txn layer, I'd like to start a thread of discussion about the path we selected, and the other possible options.

Feel free to express your opinion here, I'll create a few items I'd liek to see debated.

1) Introduction

We badly need to have a consistent system. The fact is that the current trunk - and I guess this is true for all the released we have done so far) suffers from some serious issue when multiple modifications are done during searches. The reason is that we depend on a BTree implementation that exposes a data structure directly reading the pages containing the data, expecting those pages to remain unchanged in the ong run. Obviously, when we browse more than one entry, we are likely to see a modification changing the data...

2) txn layer

There are a few way to get this problem solved :
- we can have a MVCC backend, and a protection against concurrent modifications. Any read will always succeed, as each read will use a revision and only one.
- we can also read fast the results and store them somwhere, blocking the modification until the read is finished.
- or we can keep a copy of the modified elements within the original elements, until the seraches that use those elements are finished.

(there are probably some other solutions, but I don't know them)

AFAICT, the transaction branch is implementing the third solution, keepong the copy of modified elements in memory, so that they can be sent back to the user.

None of those solution are free of drawbacks.


Right now we're adding the foundations so of course there will be issues initial. There are several techniques we can use to mitigate the problem the problems.
 
I think that the first approach, even if it implies we forces a serialization of the writes, is the best solution. The rational, AFAICT, is that we don't have to deal with the way the backend keep versions of elements, this is not our business. Plus keeping the write serialized guarantees that we won't compromized the backend.


As Selcuk already pointed out you will need the same machinery to do this below inside the partition. It will lead to the same problems. 
 
At this point, I'd like we discuss all those options, whatever we are currently working on.

3) cross-partition vs single partition protection

Atm, we are working on a cross partition system. That means we protect all the partitions at the same time : moving an entry from one partition to another one will be done completely, or reverted.

I'm not sure we need such a feature. I don't see what it brings, and even if it brings some advantages, I'm not sure we need such a feature now.


I'm in complete disagreement. There are several reasons why we need to do this across partitions:

* First keeping partitions simple, handling these semantics in partitions will make writing new partitions way too difficult to implement
* Aliases working across partitions
* Implementing views and being able to have editable views
* Centrally rooted partition
* Nestable partitions
* ACID across partitions
* Better means to integrate with HBase partition
* Better cache management
* Better means to handle snapshotting and rollback
* Clear transaction boundaries even if changes are across partitions which makes replication easier to handle.
 
Say goodbye to a lot of these factors if we do not do this.

Not having to add a txn layer above the partitions is way easier to implement.


Probably easier but not that much easier. We will need the same machinery if this will work at the partition level. And the machinery will have to be implemented separately for each partition.
 
Here, too, I'd like we discuss our options, and the pros and cons of using a txn layer on top of single partitions instead of
muliple partitions.



I'm completely against this move as I think it will cause us more problems than the ones we can fully solve right now. We just need patience. 

If Emmanuel you don't have time to deal with this painful merge, perhaps Selcuk and I can handle doing the merge?
 

ok, this is probably enough elements we have to discuss. You turn :)


I understand there are hairy issues. However realize that this is an incomplete state and realize that we do have ways to handle all the problems. Selcuk provided some excellent solutions in this thread.

To back out now would be a massive mistake. It would also curtail the growth and progress of the server in the ways described in our application document. This single decision here would be one of the worst we've ever made if we decide to back out at this stage.

FYI I'm going to be on the road for the next 48-72 hours. Will still try to respond to this thread.


--
Best Regards,
-- Alex