cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang <>
Subject Re: one idea on counters: simpler protocol due to FIFO
Date Tue, 31 May 2011 19:06:57 GMT
Thanks for the comments. response inline and marked in blue, for easier

On Tue, May 31, 2011 at 2:27 AM, Sylvain Lebresne <>wrote:

> Making the FIFO assumption stand in face of node failure is possible,
> but it's a complicated problem by itself. You basically have to make
> sure that when a node is down and comes back up, it will catch up on
> everything it has missed while off-line before accepting anything new
> (which btw, system implementing Paxos for instance have to do). But this
> require much deeper change in design that what we want to do in
> Cassandra (and it has drawbacks anyway).
"catch up on everything it has missed", not really, the case of counters is
different from Paxos in that our
state (equivalent to the Paxos concept of "law" ) is very small: just a
single count, while Paxos state is the entire law, so instead of
transferring the huge
law, Paxos transmits only the decrees (delta increments) that a node missed.
in our case, it makes more sense to transmit the state (counter value) in
the case of
reconnect/wake-up. -------- HH also seems to provide a way to "catch up on
everything it has missed" but some work needs to be done to ensure that the
messages from the HH replaying box are inserted in the same FIFO pipe as the
original coordinator.

in fact, this re-sync of state is simply what the current implementation
does for every add request, the FIFO approach only does it in abnormal
cases, and resorts
to the cheaper local updates in normal cases.  this way the normal write
path is kept same as regular writes, and the rare connection lost/node wake
up case is taken care of by localized code not having to do with normal
write path.

( the part that is  tricky to do is to "ramp up " enough new updates to
start from the state that a woken up node recently learnt: for example node
A wakes up, it learns from node B that coord 10's latest count=140,
serial=14, but node A needs to accumulate some individual add requests
serial = 12, 13, 14,...maybe  up to 20, to make sure that it can work from
state serial=14. I can look at how Zookeeper achieves FIFO assumption
despite node failures, it should be possible to copy their approach )

> > one slight technical difficulty is that when repair happens, let's say
> > we use node 3's value to repair node 2: , then node 3 sends over the
> > total_count=140, serial=14, and node 2 sees that its serial 9<14, so
> > it updates its current record to count=140, serial=14. moreover, node
> > 2 should wipe out old records, this can be done by placing an
> > "invalidator" record , so so when node 2 does tally up later, it
> > ignores all records with serial smaller than the "invalidator", or we
> > can place a fake record with count= - 100, and effectively wipe out
> > the effects of previous deltas.
> For the record, this is not "one slight technical difficulty". The
> "invalidator" record won't work because one important thing to notice is
> that we never now when we compact in which order we merge the different
> record. So say in the sstables you have for a given node:
>  - serial 0: 6   <--- those are 'delta'
>  - serial 1: 3   <-|
>  - invalidate < 2
>  - serial 2: 13  <--- this one is suppose to be the total count
>  - serial 3: 4   <--- this one is a 'delta'
> if you start merging 0, 1 and 3, which you can, you're screwed.
you are right, "invalidator" is just "delete", the above problem has been
described for delete

> Inserting a fake record with -100 may work but it requires a
> synchronized read before a write (to find which value the fake record
> should have) every time. The 'synchronized read before write' is
> probably not so easy to do and you'll have quite a bit of code clutter
> to distinguish between regular writes and repair writes, which you would
> have to but Cassandra doesn't so far.

--- actually doesn't current counter update also have the requirement to do
a read-update-write pattern? (otherwise you could lose updates if
you have 2 deltas arriving at the same time)
if current counter update works, the insertion of fake -100 will also work.
in fact, it seems that the
Table.apply() {
                synchronized (indexLockFor(mutation.key()))
takes care of this. the sstable tally up work can be inside

on the other hand,
you are right that the repair write and regular write behavior of
CounterColumn become different now, but it seems that
the main code path does not need a special-case: you only need to attach a
repair flag to the mutation when sending out the repair command,
it's up to the column to interpret this. anyway it's probably too inflexible
to strictly mandate that all repair behaviors should be same as write, or
replays of write. reconcile() behavior is already Column-specific, repair
could also be .

regarding the 400 long coordinator list issue, since Column sizes are
currently allowed to grow to megabytes , is this really a big issue? it
seems to me that the work is simply summing up 400 numbers in memory, so it
won't cause a big issue.


> --
> Sylvain

View raw message