cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-1072) Increment counters
Date Fri, 13 Aug 2010 14:16:23 GMT


Sylvain Lebresne commented on CASSANDRA-1072:

Having made some though, I see a few important difficulties/problem (in the
context of Cassandra) with a version vector (or vector clock, even though I
believe it is ill-suited for counters) approach :

* as said, it involves a read-before-write. But I think it's worth than that
  in that it involves an atomic 'read-then-write' (but maybe you already included 
  that in your comment, in which case I disagree on the 'pretty low price').
  I, at least, don't see how, without atomicity between the read and the
  write, you can ensure that you don't fall into one of:
  *# ending with two different values having exact same vector
  *# missing some increments

* explains clearly why in such approach
  you'll have to keep some history to be able to resolve later conflict. But
  it only show the beginning of the problems. Take, as in this post, the
  example of 3 nodes (and RF=3). Because stuffs can get lost, it could be very
  well be that on node 1 you'll end up receiving say both:
    ** the value 3 with a version clock [0, 1, 2]
    ** the value 3 with a version clock [0, 2, 1]
  and nothing before that (cause node 1 was dead during those first updates).
  Then you're kinda screwed. You could get back on your feet if you have the
  historic of the update of node 2 and 3, but that means that not only you
  have to keep an historic of updates for some time, you'll have to send this
  historic between nodes. Pretty sure that'll get messy.

If I'm wrong here, please feel free to correct me, I'd love to be wrong.

Now, about the current idea implemented, I want to believe that we could lift
some of (what I believe to be) the main limitation, namely that we have to
read at CL.ALL, by using some kind of repair-on-write (btw, the idea is not
mine, but Kelvin's).
That is, a write would be:
  # send the write to one chosen replica.
  # this replica write the increment locally.
  # then he reads locally (which ensure that he have the current correct
  count for the part of the counter he is responsible for)
  # he sends what he has read to other replicas and wait for a number of
  acks that depends on CL.

By choosing how many replicas we wait for in 4 before answering the whole
write query, we'll ensure the usual consistency level insurances (that is
write then read at QUORUM ensures consistency).
Sure, a write will require a local read (but only 1 btw), but in that context
I do believe it's a low price to pay (plus CL.ONE don't have to wait for it to
succeed, if high write throughput counter is what you really need). 

It is true that this will require a special verb handler for the operation.
But there is nothing incredibly new in the operations it will perform so I
believe that it could be written with not so much new specific code.

It is also true that point 1 makes the approach slightly more fragile that
usual writes, as the chosen replica may timeout on us even though other nodes
wouldn't have, but after all it makes it only as fragile as our reads (since
we ever only ask the actual value to a chosen node).

Sorry for the long post, especially if that turns out to be stupid.

> Increment counters
> ------------------
>                 Key: CASSANDRA-1072
>                 URL:
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Johan Oskarsson
>            Assignee: Kelvin Kakugawa
>         Attachments: CASSANDRA-1072-2.patch, CASSANDRA-1072.patch, Incrementcountersdesigndoc.pdf
> Break out the increment counters out of CASSANDRA-580. Classes are shared between the
two features but without the plain version vector code the changeset becomes smaller and more

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message