incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Yen <>
Subject Re: best practices for simulating transactions in Cassandra
Date Thu, 15 Dec 2011 08:07:39 GMT
I am not sure if this is the right thread to ask about this.

I read that some people are using cage+zookeeper. I was wondering if anyone
evaluates this seems to be a versatile

On Tue, Dec 13, 2011 at 6:06 AM, John Laban <> wrote:

> Ok, great.  I'll be sure to look into the virtualization-specific NTP
> guides.
> Another benefit of using Cassandra over Zookeeper for locking is that you
> don't have to worry about losing your connection to Zookeeper (and with it
> your locks) while hammering away at data in Cassandra.  If using Cassandra
> for locks, if you lose your locks you lose your connection to the datastore
> too.   (We're using long-ish session timeouts + connection listeners in ZK
> to mitigate that now.)
> John
> On Mon, Dec 12, 2011 at 12:55 PM, Dominic Williams <
>> wrote:
>> Hi John,
>> On 12 December 2011 19:35, John Laban <> wrote:
>>> So I responded to your algorithm in another part of this thread (very
>>> interesting) but this part of the paper caught my attention:
>>> > When client application code releases a lock, that lock must not
>>> actually be
>>> > released for a period equal to one millisecond plus twice the maximum
>>> possible
>>> > drift of the clocks in the client computers accessing the Cassandra
>>> databases
>>> I've been worried about this, and added some arbitrary delay in the
>>> releasing of my locks.  But I don't like it as it's (A) an arbitrary value
>>> and (B) it will - perhaps greatly - reduce the throughput of the more
>>> high-contention areas of my system.
>>> To fix (B) I'll probably just have to try to get rid of locks all
>>> together in these high-contention areas.
>>> To fix (A), I'd need to know what the maximum possible drift of my
>>> clocks will be.  How did you determine this?  What value do you use, out of
>>> curiosity?  What does the network layout of your client machines look like?
>>>  (Are any of your hosts geographically separated or all running in the same
>>> DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
>>> clock skew on an ongoing basis?  Am I worrying too much?
>> If you setup NTP carefully no machine should drift more than 4ms say. I
>> forget where, but you'll find the best documentation on how to make a
>> bullet-proof NTP setup on vendor sites for virtualization software (because
>> virtualization software can cause drift so NTP setup has to be just so)
>> What this means is that, for example, to be really safe when a thread
>> releases a lock you should wait say 9ms. Some points:-
>> -- since the sleep is performed before release, an isolated operation
>> should not be delayed at all
>> -- only a waiting thread or a thread requesting a lock immediately it is
>> released will be delayed, and no extra CPU or memory load is involved
>> -- in practice for the vast majority of "application layer" data
>> operations this restriction will have no effect on overall performance as
>> experienced by a user, because such operations nearly always read and write
>> to data with limited scope, for example the data of two users involved in
>> some transaction
>> -- the clocks issue does mean that you can't really serialize access to
>> more broadly shared data where more than 5 or 10 such requests are made a
>> second, say, but in reality even if the extra 9ms sleep on release wasn't
>> necessary, variability in database operation execution time (say under
>> load, or when something goes wrong) means trouble might occur serializing
>> with that level of contention
>> So in summary, although this drift thing seems bad at first, partly
>> because it is a new consideration, in practice it's no big deal so long as
>> you look after your clocks (and the main issue to watch out for is when
>> application nodes running on virtualization software, hypervisors et al
>> have setup issues that make their clocks drift under load, and it is a good
>> idea to be wary of that)
>> Best, Dominic
>>> Sorry for all the questions but I'm very concerned about this particular
>>> problem :)
>>> Thanks,
>>> John
>>> On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams <
>>>> wrote:
>>>> Hi guys, just thought I'd chip in...
>>>> Fight My Monster is still using Cages, which is working fine, but...
>>>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are
>>>> 2 main reasons:-
>>>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we
>>>> aren't getting anywhere near to capacity and we do a *lot*
>>>> of serialisation) at some point it will be necessary to start hashing lock
>>>> paths onto separate ZooKeeper clusters, and I tend to believe that these
>>>> days you should choose platforms that handle sharding themselves (e.g.
>>>> choose Cassandra rather than MySQL)
>>>> 2. Why have more components in your system when you can have less!!!
>>>> KISS
>>>> Recently I therefore tried to devise an algorithm which can be used to
>>>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>>>> etc.
>>>> There is a doc describing the algorithm, to which may be added an
>>>> appendix describing a protocol so that locking can be interoperable between
>>>> the clients. That could be extended to describe a protocol for
>>>> transactions. Word of warning this is a *beta* algorithm that has only been
>>>> seen by a select group so far, and therefore not even 100% sure it works
>>>> but there is a useful general discussion regarding serialization of
>>>> reads/writes so I include it anyway (and since this algorithm is going to
>>>> be out there now, if there's anyone out there who fancies doing a Z proof
>>>> or disproof, that would be fantastic).
>>>> Final word on this re transactions: if/when transactions are added to
>>>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>>>> performance than ZooKeeper for storing snapshots, especially as transaction
>>>> size increases
>>>> Best, Dominic
>>>> On 11 December 2011 01:53, Guy Incognito <> wrote:
>>>>>  you could try writing with the clock of the initial replay entry?
>>>>> On 06/12/2011 20:26, John Laban wrote:
>>>>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>>>>> transactions to Cages, but instead of snapshotting the data to be rolled
>>>>> back (the "before" data), you snapshot the data to be replayed (the "after"
>>>>> data).  And then later, if you find that the transaction didn't complete,
>>>>> you just keep replaying the transaction until it takes.
>>>>>  The part I don't understand with this approach though:  how do you
>>>>> ensure that someone else didn't change the data between your initial
>>>>> transaction and the later replaying of the transaction?  You could get
>>>>> writes in that situation.
>>>>>  Dominic (in the Cages blog post) explained a workaround with that
>>>>> for his rollback proposal:  all subsequent readers or writers of that
>>>>> would have to check for abandoned transactions and roll them back
>>>>> themselves before they could read the data.  I don't think this is possible
>>>>> with the XACT_LOG "replay" approach in these slides though, based on
>>>>> the data is indexed (cassandra node token + timeUUID).
>>>>>  PS:  How are you liking Cages?
>>>>> 2011/12/6 Jérémy SEVELLEC <>
>>>>>> Hi John,
>>>>>>  I had exactly the same reflexions.
>>>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>>>  but how to rollback?
>>>>>> It's impossible so try replay!
>>>>>>  the idea is explained in this presentation
>>>>>> (starting
>>>>>> from slide 24)
>>>>>>  - insert your whole data into one column
>>>>>> - make the job
>>>>>> - remove (or expire) your column.
>>>>>>  if there is a problem during "making the job", you keep the
>>>>>> possibility to replay and replay and replay (synchronously or in
a batch).
>>>>>>  Regards
>>>>>>  Jérémy
>>>>>> 2011/12/5 John Laban <>
>>>>>>> Hello,
>>>>>>>  I'm building a system using Cassandra as a datastore and I have
>>>>>>> few places where I am need of transactions.
>>>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>>>> concurrency control or isolation, so that solves that half of
the puzzle.
>>>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>>>> multiple writes by simulating the "begin/rollback/commit" abilities
of a
>>>>>>> relational DB.  In other words, there are places where I need
to perform
>>>>>>> multiple updates/inserts, and if I fail partway through, I would
ideally be
>>>>>>> able to rollback the partially-applied updates.
>>>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm
>>>>>>> looking for are all the best practices, or at least tips and
tricks, so
>>>>>>> that I can get around this limitation in Cassandra and still
maintain a
>>>>>>> consistent datastore.  (I am using quorum reads/writes so that
>>>>>>> consistency doesn't kick my ass here as well.)
>>>>>>>  Below are some ideas I've been able to dig up.  Please let me
>>>>>>> if any of them don't make sense, or if there are better approaches:
>>>>>>>  1) Updates to a row in a column family are atomic.  So try to
>>>>>>> model your data so that you would only ever need to update a
single row in
>>>>>>> a single CF at once.  Essentially, you model your data around
>>>>>>>  This is tricky but can certainly be done in some situations.
>>>>>>>  2) If you are only dealing with multiple row *inserts* (and
>>>>>>> updates), have one of the rows act as a 'commit' by essentially
>>>>>>> the presence of the other rows.  For example, say you were performing
>>>>>>> operation where you wanted to create an Account row and 5 User
rows all at
>>>>>>> once (this is an unlikely example, but bear with me).  You could
insert 5
>>>>>>> rows into the Users CF, and then the 1 row into the Accounts
CF, which acts
>>>>>>> as the commit.  If something went wrong before the Account could
>>>>>>> created, any Users that had been created so far would be orphaned
>>>>>>> unusable, as your business logic can ensure that they can't exist
>>>>>>> an Account.  You could also have an offline cleanup process that
swept away
>>>>>>> orphans.
>>>>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>>>>  How do you model updates as inserts?  Instead of munging the
>>>>>>> directly, you could insert a column containing the operation
you want to
>>>>>>> perform (like "+5").  It would work kind of like the Consistent
>>>>>>> Counting implementation: ( ).
>>>>>>> do you make the inserts idempotent?  Make sure the column names
>>>>>>> to a request ID or some other identifier that would be identical
>>>>>>> re-drives of a given (perhaps originally failed) request.  This
could leave
>>>>>>> your datastore in a temporarily inconsistent state, but would
>>>>>>> become consistent after a successful re-drive of the original
>>>>>>>  4) You could take an approach like Dominic Williams proposed
>>>>>>> Cages:
  The gist is that you snapshot all the original values that you're about
>>>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates,
>>>>>>> then delete the snapshot (and that delete needs to be atomic).
 If the
>>>>>>> snapshot data was never deleted, then subsequent accessors (even
>>>>>>> of the data rows need to do the rollback of the previous transaction
>>>>>>> themselves before they can read/write this data.  They do the
rollback by
>>>>>>> just overwriting the current values with what is in the snapshot.
>>>>>>> offloads the work of the rollback to the next worker that accesses
>>>>>>> data.  This approach probably needs an generic/high-level programming
>>>>>>> to handle all of the details and complexity, and it doesn't seem
like it
>>>>>>> was ever added to Cages.
>>>>>>>  Are there other approaches or best practices that I missed?
>>>>>>> would be very interested in hearing any opinions from those who
>>>>>>> tackled these problems before.
>>>>>>>  Thanks!
>>>>>>>  John
>>>>>>   --
>>>>>> Jérémy

View raw message