cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <>
Subject Re: one idea on counters: simpler protocol due to FIFO
Date Tue, 31 May 2011 09:27:24 GMT
> we claim that, because of the FIFO assumption, here node 2 has seen
> all messages  from coord 10  with a serial <=9, , and node 3 has seen
> all messages with serial <=14, so that node 2's history is a prefix of
> that of node 3. i.e. what we read out immediately from node 2
> represents a value that we could have read out from a past instant
> from node 3, i.e. the current value of node 2 is a  valid value,
> although slightly stale.

The FIFO assumption doesn't stand because of node failures.

Let's take an example. You have three node, A, B, C and RF=2. You do 10
increments on a counter (+1) using A as coordinator (so the serial for A
at the end will be 9) and B and C are the replica for that counter.

Now say that B was dead while increments with serial 3 to 7 are emitted.
In other words, it sees serial 0 to 2 and then 8 and 9. At the end of
this, B has for node A the serial number 9 and the value 5.

And suppose that C was dead for different increments, says during
increment 0, 1 and 2. So it has seen only the serial from 3 to 9. So in
the end, it has for A the serial 9 and the value 7.

At this point, we may have done all increments successfully at CL.ONE,
but good luck to return the value 10 for a read at CL.ALL.

Making the FIFO assumption stand in face of node failure is possible,
but it's a complicated problem by itself. You basically have to make
sure that when a node is down and comes back up, it will catch up on
everything it has missed while off-line before accepting anything new
(which btw, system implementing Paxos for instance have to do). But this
require much deeper change in design that what we want to do in
Cassandra (and it has drawbacks anyway).

> one slight technical difficulty is that when repair happens, let's say
> we use node 3's value to repair node 2: , then node 3 sends over the
> total_count=140, serial=14, and node 2 sees that its serial 9<14, so
> it updates its current record to count=140, serial=14. moreover, node
> 2 should wipe out old records, this can be done by placing an
> "invalidator" record , so so when node 2 does tally up later, it
> ignores all records with serial smaller than the "invalidator", or we
> can place a fake record with count= - 100, and effectively wipe out
> the effects of previous deltas.

For the record, this is not "one slight technical difficulty". The
"invalidator" record won't work because one important thing to notice is
that we never now when we compact in which order we merge the different
record. So say in the sstables you have for a given node:
  - serial 0: 6   <--- those are 'delta'
  - serial 1: 3   <-|
  - invalidate < 2
  - serial 2: 13  <--- this one is suppose to be the total count
  - serial 3: 4   <--- this one is a 'delta'
if you start merging 0, 1 and 3, which you can, you're screwed.

Inserting a fake record with -100 may work but it requires a
synchronized read before a write (to find which value the fake record
should have) every time. The 'synchronized read before write' is
probably not so easy to do and you'll have quite a bit of code clutter
to distinguish between regular writes and repair writes, which you would
have to but Cassandra doesn't so far.


View raw message