cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <>
Subject Re: one way to make counter delete work better
Date Mon, 13 Jun 2011 16:04:48 GMT
I don't think that's bulletproof either.  For instance, what if the
two adds go to replica 1 but the delete to replica 2?

Bottom line (and this was discussed on the original
delete-for-counters ticket,, counter deletes
are not fully commutative which makes them fragile.

On Mon, Jun 13, 2011 at 10:54 AM, Yang <> wrote:
> as
> indicates, the problem with counter delete is  in scenarios like the
> following:
> add 1, clock 100
> delete , clock 200
> add  2 , clock 300
> if the 1st and 3rd operations are merged in SStable compaction, then we
> have
> delete  clock 200
> add 3,  clock 300
> which shows wrong result.
> I think a relatively simple extension can be used to complete fix this
> issue: similar to ZooKeeper, we can prefix an "Epoch" number to the clock,
> so that
>    1) a delete operation increases future epoch number by 1
>    2) merging of delta adds can be between only deltas of the same epoch,
> deltas of older epoch are simply ignored during merging. merged result keeps
> the epoch number of the newest seen.
> other operations remain the same as current. note that the above 2 rules are
> only concerned with merging within the deltas on the leader, and not related
> to the replicated count, which is a simple final state, and observes the
> rule of "larger clock trumps". naturally the ordering rule is: epoch1.clock1
>> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
> intuitively "epoch" can be seen as the serial number on a new "incarnation"
> of a counter.
> code change should be mostly localized to CounterColumn.reconcile(),
>  although, if an update does not find existing entry in memtable, we need to
> go to sstable to fetch any possible epoch number, so
> compared to current write path, in the "no replicate-on-write" case, we need
> to add a read to sstable. but in the "replicate-on-write" case, we already
> read that, so it's no extra time cost.  "no replicate-on-write" is not a
> very useful setup in reality anyway.
> does this sound a feasible way?   if this works, expiring counter should
> also naturally work.
> Thanks
> Yang

Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support

View raw message