cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <>
Subject Re: one idea on counters: simpler protocol due to FIFO
Date Tue, 31 May 2011 09:49:33 GMT
One more thing, if you keep one sub-count for each coordinator,
that won't be fun in a 400 nodes cluster. Or, to put this in another
way, you incur on the client the burden of making sure that for
each counter it will use a reasonably small set of coordinator ever.
Which can actually be a major headache on clients.
The choice of a sub-count by replica alleviates almost entirely this
problem (almost because change in the ring topology may still make
the vector increase slowly in size. But the grows is slow, much more
contained and there is idea to solve this).


On Tue, May 31, 2011 at 11:27 AM, Sylvain Lebresne <> wrote:
>> we claim that, because of the FIFO assumption, here node 2 has seen
>> all messages  from coord 10  with a serial <=9, , and node 3 has seen
>> all messages with serial <=14, so that node 2's history is a prefix of
>> that of node 3. i.e. what we read out immediately from node 2
>> represents a value that we could have read out from a past instant
>> from node 3, i.e. the current value of node 2 is a  valid value,
>> although slightly stale.
> The FIFO assumption doesn't stand because of node failures.
> Let's take an example. You have three node, A, B, C and RF=2. You do 10
> increments on a counter (+1) using A as coordinator (so the serial for A
> at the end will be 9) and B and C are the replica for that counter.
> Now say that B was dead while increments with serial 3 to 7 are emitted.
> In other words, it sees serial 0 to 2 and then 8 and 9. At the end of
> this, B has for node A the serial number 9 and the value 5.
> And suppose that C was dead for different increments, says during
> increment 0, 1 and 2. So it has seen only the serial from 3 to 9. So in
> the end, it has for A the serial 9 and the value 7.
> At this point, we may have done all increments successfully at CL.ONE,
> but good luck to return the value 10 for a read at CL.ALL.
> Making the FIFO assumption stand in face of node failure is possible,
> but it's a complicated problem by itself. You basically have to make
> sure that when a node is down and comes back up, it will catch up on
> everything it has missed while off-line before accepting anything new
> (which btw, system implementing Paxos for instance have to do). But this
> require much deeper change in design that what we want to do in
> Cassandra (and it has drawbacks anyway).
>> one slight technical difficulty is that when repair happens, let's say
>> we use node 3's value to repair node 2: , then node 3 sends over the
>> total_count=140, serial=14, and node 2 sees that its serial 9<14, so
>> it updates its current record to count=140, serial=14. moreover, node
>> 2 should wipe out old records, this can be done by placing an
>> "invalidator" record , so so when node 2 does tally up later, it
>> ignores all records with serial smaller than the "invalidator", or we
>> can place a fake record with count= - 100, and effectively wipe out
>> the effects of previous deltas.
> For the record, this is not "one slight technical difficulty". The
> "invalidator" record won't work because one important thing to notice is
> that we never now when we compact in which order we merge the different
> record. So say in the sstables you have for a given node:
>  - serial 0: 6   <--- those are 'delta'
>  - serial 1: 3   <-|
>  - invalidate < 2
>  - serial 2: 13  <--- this one is suppose to be the total count
>  - serial 3: 4   <--- this one is a 'delta'
> if you start merging 0, 1 and 3, which you can, you're screwed.
> Inserting a fake record with -100 may work but it requires a
> synchronized read before a write (to find which value the fake record
> should have) every time. The 'synchronized read before write' is
> probably not so easy to do and you'll have quite a bit of code clutter
> to distinguish between regular writes and repair writes, which you would
> have to but Cassandra doesn't so far.
> --
> Sylvain

View raw message