cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: one way to make counter delete work better
Date Mon, 13 Jun 2011 16:04:48 GMT
I don't think that's bulletproof either.  For instance, what if the
two adds go to replica 1 but the delete to replica 2?

Bottom line (and this was discussed on the original
delete-for-counters ticket,
https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
are not fully commutative which makes them fragile.

On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy123@gmail.com> wrote:
> as https://issues.apache.org/jira/browse/CASSANDRA-2101
> indicates, the problem with counter delete is  in scenarios like the
> following:
> add 1, clock 100
> delete , clock 200
> add  2 , clock 300
> if the 1st and 3rd operations are merged in SStable compaction, then we
> have
> delete  clock 200
> add 3,  clock 300
> which shows wrong result.
>
> I think a relatively simple extension can be used to complete fix this
> issue: similar to ZooKeeper, we can prefix an "Epoch" number to the clock,
> so that
>    1) a delete operation increases future epoch number by 1
>    2) merging of delta adds can be between only deltas of the same epoch,
> deltas of older epoch are simply ignored during merging. merged result keeps
> the epoch number of the newest seen.
> other operations remain the same as current. note that the above 2 rules are
> only concerned with merging within the deltas on the leader, and not related
> to the replicated count, which is a simple final state, and observes the
> rule of "larger clock trumps". naturally the ordering rule is: epoch1.clock1
>> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
clock2
> intuitively "epoch" can be seen as the serial number on a new "incarnation"
> of a counter.
>
> code change should be mostly localized to CounterColumn.reconcile(),
>  although, if an update does not find existing entry in memtable, we need to
> go to sstable to fetch any possible epoch number, so
> compared to current write path, in the "no replicate-on-write" case, we need
> to add a read to sstable. but in the "replicate-on-write" case, we already
> read that, so it's no extra time cost.  "no replicate-on-write" is not a
> very useful setup in reality anyway.
>
> does this sound a feasible way?   if this works, expiring counter should
> also naturally work.
>
> Thanks
> Yang



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Mime
View raw message