zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@yahoo-inc.com>
Subject Re: How to ensure transaction create-and-update
Date Tue, 30 Mar 2010 22:42:11 GMT
Sinfonia is pretty cool, but the commit mechanism is not simple and the 
ordering guarantees are different. i think we can do it simpler in 
zookeeper. basically, we would just need to be able to list a set of 
operations in a single zxid rather than just one operation. in some 
sense we do do this a little bit: close session is an atomic transaction 
that deletes a bunch of ephemeral nodes.

to be honest here are the reservations i have:

1) ted's "non-blocking" observation is very good. right now we do 
throttling and balancing to give consistent response time for users, and 
it works pretty well because all operations are more or less equivalent. 
if you can make a compound operation made up of multiple sub operations 
(especially if they fall in the less equivalent case) this may not be 
the case and for practical purposes you get blocking.

2) returning errors starts getting a bit funky. what happens if some of 
the operations fail? like a create or conditional set. not a big problem 
to decide and implement, but i think it makes it harder to use. (there 
will be use cases to motivate all sorts of different choices: abort the 
whole thing on any failures, execute everything an return results, fail 
everything after first failure, etc)

3) if we ever get to partitioned namespace, it will be very hard to do 
transactions across partitions. we will probably relax ordering across 
partitions, so you could argue that we wouldn't support transactions 
either, but then the question comes back to how do you reflect this back 
to the user?

perhaps, we may want to broach this in the future, but i would rather 
get things like ZOOKEEPER-22 in before we complicate things.


On 03/30/2010 01:29 PM, Henry Robinson wrote:
> [Moving to dev]
> Although I'm in total agreement with the idea of "no complexity until it's
> necessary" I don't see that there's a really strong technical reason not to
> include this primitive. It's very similar to the multi-get style API that,
> say, memcache gives you.
> zoo_multi_test_and_set(List<int>  versions, List<string>  znodes, List<byte[]>
> data)
> would be an example API, and seems to me like it could be implemented in the
> same way as a single set_data call. I definitely don't support any kind of
> multiple-call api (like transactions) because it doesn't fit with the
> ZooKeeper one method call = 1 linearization point model. I really do
> recommend the Sinfonia paper from SOSP '07 for those that haven't read it (
> http://www.hpl.hp.com/personal/Mehul_Shah/papers/sosp_2007_aguilera.pdf) for
> a nice implementation of these kinds of ideas.
> A supporting argument is this: if this *is* very hard to implement
> currently, I think we could expend some effort to make it easier. Decoupling
> operations on the data tree and voting for them further (and also decoupling
> session management and data tree updates) would be a worthwhile cleanup for
> 3.4.0. It would be really cool to be able to put a different storage engine
> behind ZK (I can think of many examples!) with a minimum of effort. At the
> same time, there are some API calls that I might find useful (get minimum
> sequential node, for example) whose prototyping and implementation would be
> made easier.
> cheers,
> Henry
> On 30 March 2010 13:00, Benjamin Reed<breed@yahoo-inc.com>  wrote:
>> i agree with ted. i think he points out some disadvantages with trying do
>> do more. there is a slippery slope with these kinds of things. the
>> implementation is complicated enough even with the simple model that we use.
>> ben
>> On 03/29/2010 08:34 PM, Ted Dunning wrote:
>>> I perhaps should not have said power, except insofar as ZK's strengths are
>>> in reliability which derives from simplicity.
>>> There are essentially two common ways to implement multi-node update.  The
>>> first is the tradtional db style with begin-transaction paired with either
>>> a
>>> commit or a rollback after some number of updates.  This is clearly
>>> unacceptable in the ZK world if the updates are sent to the server because
>>> there can be an indefinite delay between the begin and commit.
>>> A second approach is to buffer all of the updates on the client side and
>>> transmit them in a batch to the server to succeed or fail as a group.
>>>   This
>>> allows updates to be arbitrarily complex which begins to eat away at the
>>> "no-blocking" guarantee a bit.
>>> On Mon, Mar 29, 2010 at 8:08 PM, Henry Robinson<henry@cloudera.com>
>>>   wrote:
>>>> Could you say a bit about how you feel ZK would sacrifice power and
>>>> reliability through multi-node updates? My view is that it wouldn't:
>>>> since
>>>> all operations are executed serially, there's no concurrency to be lost
>>>> by
>>>> allowing multi-updates, and there doesn't need to be a 'start / end'
>>>> transactional style interface (which I do believe would be very bad).
>>>> I could see ZK implement a Sinfonia-style batch operation API which makes
>>>> all-or-none updates. The reason I can see that it doesn't already allow
>>>> this
>>>> is the avowed intent of the original ZK team to keep the API as simple as
>>>> it
>>>> can reasonably be, and to not introduce complexity without need.

View raw message