incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Laban <j...@pagerduty.com>
Subject Re: best practices for simulating transactions in Cassandra
Date Mon, 12 Dec 2011 21:54:09 GMT
Hey Jake,

So I guess my problem is that I've never really relied on NTP before to try
to guarantee consistency in my application.  Does it tend to work really
well in practice?  What's the maximum clock skew you can see even when
running NTP (especially if you're using more than one DC where you might
have some latency between clients)?

Reading your code, it looks like you sleep 100ms in between the column
insert and the subsequent read in order to mitigate clock skew.  That might
be enough to fix the problem there, and I think the algorithm makes sense
under the assumption that there won't be more than 100ms clock skew.

Is it possible for clients' clocks to deviate more than ~100ms when using
NTP?  Dominic mentioned elsewhere in this thread that clock skew might
often be an order of magnitude less, so 100ms seems safe.


> It inspects each column, that represents a different acquire attempt
> and compares those timestamps.  so if client A is skewed in the past
> but encounters a non-expiring column it knows the lock is taken.

If you *didn't* have that 100ms guard, there are ways in your algorithm for
two clients to both think they won the lock.  I've labelled the steps of
the interesting part:

   1) client writes to that row @ QUORUM a column name of it's ID with a
TTL of N seconds
   2) client instantly reads back the entire row @ QUORUM
   3) if client encounters a column that is non-expiring then the lock is
already acquired.
   4) if client encounters a non-deleted but expiring column with a
timestamp < the one it wrote then it sleeps and tries again.
   5) if clients own timestamp was the earliest then it has won the lock
and writes a non-expiring column of the same name to mark it as officially
locked.

Lets assume client A's clock is ahead of client B:

A does step 1 and 2
B does step 1 and 2
A does steps 3, 4, and 5 (it has the only column it read back in, so it
assumes it wins)
B does steps 3, 4, and 5 (it has the youngest column because of clock skew,
so it assumes it wins)

(Or did I made a mistake somewhere?)

Anyway, like I said, it should work if NTP keeps the clocks nice and tight.

John




On Mon, Dec 12, 2011 at 11:54 AM, Jake Luciani <jakers@gmail.com> wrote:

>
>> Jake:  The algorithm you've outlined is pretty similar to how Zookeeper
>> clients implement locking.  The potential only issue that I see with it
>> implemented in Cassandra is that it uses the timestamps of the inserted
>> columns to determine the winner of the lock.  The column timestamps are
>> generated by the clients (whose clocks can drift from each other), so its
>> possible for a client (whose clock is skewed to some time in the near past)
>> to accidentally "steal" a lock from another client who presently thinks
>> that it is the winner of the lock.  At least it seems that way to me.
>>
>>
> I don't see that. if a client wants to abuse the system or doesn't run NTP
> then it can grab all the locks. but each lock is guaranteed to be owned by
> one person. since the client timestamps are used to pick a winner, see
> point 4 and 5
>
> It inspects each column, that represents a different acquire attempt and
> compares those timestamps.  so if client A is skewed in the past but
> encounters a non-expiring column it knows the lock is taken.
>
> -Jake
>
>
>
>> Dominic:  I'll have to read-read your paper a few times (while furrowing
>> my brow and scratching my head) before I can convince myself that the
>> proposed algorithm doesn't have the possibility of deadlock or livelock.
>>  It does seem that you have covered a lot of the bases though.
>>
>> Thanks for sharing guys :)
>> John
>>
>>
>> On Mon, Dec 12, 2011 at 6:21 AM, Jake Luciani <jakers@gmail.com> wrote:
>>
>>> I've written a locking mechanism for Solandra  (I refer to it as a
>>> reservation system) which basically allows you to acquire a lock.  This is
>>> used to ensure a node is service unique sequential IDs for lucene.
>>>
>>> It sounds a bit similar to Dominic's description but I'll explain how
>>> the Solandra one works.
>>>
>>> The code is at
>>> https://github.com/tjake/Solandra/blob/solandra/src/lucandra/cluster/CassandraIndexManager.java#L714
>>>
>>> The algorithm is basically:
>>>
>>>    - each node has a unique id.
>>>    - a lock name is a row key
>>>    - client writes to that row @ QUORUM a column name of it's ID with a
>>> TTL of N seconds
>>>    - client instantly reads back the entire row @ QUORUM
>>>    - if client encounters a column that is non-expiring then the lock is
>>> already acquired.
>>>    - if client encounters a non-deleted but expiring column with a
>>> timestamp < the one it wrote then it sleeps and tries again.
>>>    - if clients own timestamp was the earliest then it has won the lock
>>> and writes a non-expiring column of the same name to mark it as officially
>>> locked.
>>>    - in the case of a tie (2 columns with same ts the uuids are sorted
>>> and the lesser one wins)
>>>    - once finished, node with the lock deletes the column and frees the
>>> lock.
>>>
>>> This algorithm allows for deadlocks because the client has a huge number
>>> of locks to work with.  It would be fairly simple to use a TTL again to
>>> make locks auto expire after N seconds, this would make it more like google
>>> chubby.
>>>
>>> It also allows for bad clients to game the system but that's not
>>> something that could be dealt with using authorization apis.
>>>
>>> For legacy reasons the linked code uses super columns but a regular
>>> column family will work just fine.
>>>
>>> -Jake
>>>
>>>
>>> On Mon, Dec 12, 2011 at 7:36 AM, Dominic Williams <
>>> dwilliams@fightmymonster.com> wrote:
>>>
>>>> Hi guys, just thought I'd chip in...
>>>>
>>>> Fight My Monster is still using Cages, which is working fine, but...
>>>>
>>>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are
>>>> 2 main reasons:-
>>>>
>>>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we
>>>> aren't getting anywhere near to capacity and we do a *lot*
>>>> of serialisation) at some point it will be necessary to start hashing lock
>>>> paths onto separate ZooKeeper clusters, and I tend to believe that these
>>>> days you should choose platforms that handle sharding themselves (e.g.
>>>> choose Cassandra rather than MySQL)
>>>>
>>>> 2. Why have more components in your system when you can have less!!!
>>>> KISS
>>>>
>>>> Recently I therefore tried to devise an algorithm which can be used to
>>>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>>>> etc.
>>>>
>>>> There is a doc describing the algorithm, to which may be added an
>>>> appendix describing a protocol so that locking can be interoperable between
>>>> the clients. That could be extended to describe a protocol for
>>>> transactions. Word of warning this is a *beta* algorithm that has only been
>>>> seen by a select group so far, and therefore not even 100% sure it works
>>>> but there is a useful general discussion regarding serialization of
>>>> reads/writes so I include it anyway (and since this algorithm is going to
>>>> be out there now, if there's anyone out there who fancies doing a Z proof
>>>> or disproof, that would be fantastic).
>>>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>>>>
>>>> Final word on this re transactions: if/when transactions are added to
>>>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>>>> performance than ZooKeeper for storing snapshots, especially as transaction
>>>> size increases
>>>>
>>>> Best, Dominic
>>>>
>>>> On 11 December 2011 01:53, Guy Incognito <dnd1066@gmail.com> wrote:
>>>>
>>>>>  you could try writing with the clock of the initial replay entry?
>>>>>
>>>>> On 06/12/2011 20:26, John Laban wrote:
>>>>>
>>>>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>>>>> transactions to Cages, but instead of snapshotting the data to be rolled
>>>>> back (the "before" data), you snapshot the data to be replayed (the "after"
>>>>> data).  And then later, if you find that the transaction didn't complete,
>>>>> you just keep replaying the transaction until it takes.
>>>>>
>>>>>  The part I don't understand with this approach though:  how do you
>>>>> ensure that someone else didn't change the data between your initial
failed
>>>>> transaction and the later replaying of the transaction?  You could get
lost
>>>>> writes in that situation.
>>>>>
>>>>>  Dominic (in the Cages blog post) explained a workaround with that
>>>>> for his rollback proposal:  all subsequent readers or writers of that
data
>>>>> would have to check for abandoned transactions and roll them back
>>>>> themselves before they could read the data.  I don't think this is possible
>>>>> with the XACT_LOG "replay" approach in these slides though, based on
how
>>>>> the data is indexed (cassandra node token + timeUUID).
>>>>>
>>>>>
>>>>>  PS:  How are you liking Cages?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2011/12/6 Jérémy SEVELLEC <jsevellec@gmail.com>
>>>>>
>>>>>> Hi John,
>>>>>>
>>>>>>  I had exactly the same reflexions.
>>>>>>
>>>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>>>
>>>>>>  but how to rollback?
>>>>>> It's impossible so try replay!
>>>>>>
>>>>>>  the idea is explained in this presentation
>>>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>>>>> from slide 24)
>>>>>>
>>>>>>  - insert your whole data into one column
>>>>>> - make the job
>>>>>> - remove (or expire) your column.
>>>>>>
>>>>>>  if there is a problem during "making the job", you keep the
>>>>>> possibility to replay and replay and replay (synchronously or in
a batch).
>>>>>>
>>>>>>  Regards
>>>>>>
>>>>>>  Jérémy
>>>>>>
>>>>>>
>>>>>> 2011/12/5 John Laban <john@pagerduty.com>
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>>  I'm building a system using Cassandra as a datastore and I have
a
>>>>>>> few places where I am need of transactions.
>>>>>>>
>>>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>>>> concurrency control or isolation, so that solves that half of
the puzzle.
>>>>>>>
>>>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>>>> multiple writes by simulating the "begin/rollback/commit" abilities
of a
>>>>>>> relational DB.  In other words, there are places where I need
to perform
>>>>>>> multiple updates/inserts, and if I fail partway through, I would
ideally be
>>>>>>> able to rollback the partially-applied updates.
>>>>>>>
>>>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm
>>>>>>> looking for are all the best practices, or at least tips and
tricks, so
>>>>>>> that I can get around this limitation in Cassandra and still
maintain a
>>>>>>> consistent datastore.  (I am using quorum reads/writes so that
eventual
>>>>>>> consistency doesn't kick my ass here as well.)
>>>>>>>
>>>>>>>  Below are some ideas I've been able to dig up.  Please let me
know
>>>>>>> if any of them don't make sense, or if there are better approaches:
>>>>>>>
>>>>>>>
>>>>>>>  1) Updates to a row in a column family are atomic.  So try to
>>>>>>> model your data so that you would only ever need to update a
single row in
>>>>>>> a single CF at once.  Essentially, you model your data around
transactions.
>>>>>>>  This is tricky but can certainly be done in some situations.
>>>>>>>
>>>>>>>  2) If you are only dealing with multiple row *inserts* (and
not
>>>>>>> updates), have one of the rows act as a 'commit' by essentially
validating
>>>>>>> the presence of the other rows.  For example, say you were performing
an
>>>>>>> operation where you wanted to create an Account row and 5 User
rows all at
>>>>>>> once (this is an unlikely example, but bear with me).  You could
insert 5
>>>>>>> rows into the Users CF, and then the 1 row into the Accounts
CF, which acts
>>>>>>> as the commit.  If something went wrong before the Account could
be
>>>>>>> created, any Users that had been created so far would be orphaned
and
>>>>>>> unusable, as your business logic can ensure that they can't exist
without
>>>>>>> an Account.  You could also have an offline cleanup process that
swept away
>>>>>>> orphans.
>>>>>>>
>>>>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>>>>  How do you model updates as inserts?  Instead of munging the
value
>>>>>>> directly, you could insert a column containing the operation
you want to
>>>>>>> perform (like "+5").  It would work kind of like the Consistent
Vote
>>>>>>> Counting implementation: ( https://gist.github.com/416666 ).
 How
>>>>>>> do you make the inserts idempotent?  Make sure the column names
correspond
>>>>>>> to a request ID or some other identifier that would be identical
across
>>>>>>> re-drives of a given (perhaps originally failed) request.  This
could leave
>>>>>>> your datastore in a temporarily inconsistent state, but would
eventually
>>>>>>> become consistent after a successful re-drive of the original
request.
>>>>>>>
>>>>>>>  4) You could take an approach like Dominic Williams proposed
with
>>>>>>> Cages:
>>>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/
  The gist is that you snapshot all the original values that you're about
>>>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates,
and
>>>>>>> then delete the snapshot (and that delete needs to be atomic).
 If the
>>>>>>> snapshot data was never deleted, then subsequent accessors (even
readers)
>>>>>>> of the data rows need to do the rollback of the previous transaction
>>>>>>> themselves before they can read/write this data.  They do the
rollback by
>>>>>>> just overwriting the current values with what is in the snapshot.
 It
>>>>>>> offloads the work of the rollback to the next worker that accesses
the
>>>>>>> data.  This approach probably needs an generic/high-level programming
layer
>>>>>>> to handle all of the details and complexity, and it doesn't seem
like it
>>>>>>> was ever added to Cages.
>>>>>>>
>>>>>>>
>>>>>>>  Are there other approaches or best practices that I missed?
 I
>>>>>>> would be very interested in hearing any opinions from those who
have
>>>>>>> tackled these problems before.
>>>>>>>
>>>>>>>  Thanks!
>>>>>>>  John
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>   --
>>>>>> Jérémy
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> http://twitter.com/tjake
>>>
>>
>>
>
>
> --
> http://twitter.com/tjake
>

Mime
View raw message