ok, I think it's better to understand it this way, then it is really simple
and intuitive:
my proposed way of counter update can be simply seen as a combination of
regular columns + current counter columns:
regular column : [ value: "wipes out every bucket to nil" , clock: epoch
number]
then within each epoch, counter updates work as currently implemented
On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy123@gmail.com> wrote:
> I think this approach also works for your scenario:
>
> I thought that the issue is only concerned with merging within the same
> leader; but you pointed out
> that a similar merging happens between leaders too, now I see that the same
> rules on epoch number
> also applies to inter-leader data merging, specifically in your case:
>
>
> everyone starts with epoch of 0, ( they should be same, if not, it also
> works, we just consider them to be representing diffferent time snapshots of
> the same counter state)
>
> node A add 1 clock: 0.100 (epoch = 0, clock number = 100)
>
> node A delete clock: 0.200
>
> node B add 2 clock: 0.300
>
> node A gets B's state: add 2 clock 0.300, but rejects it because A has
> already produced a delete, with epoch of 0, so A considers epoch 0 already
> ended, it won't accept any replicated state with epoch < 1.
>
> node B gets A's delete 0.200, it zeros its own count of "2", and
> updates its future expected epoch to 1.
>
> at this time, the state of system is:
> node A expected epoch =1 [A:nil] [B:nil]
> same for node B
>
>
>
> let's say we have following further writes:
>
> node B add 3 clock 1.400
>
> node A adds 4 clock 1.500
>
> node B receives A's add 4, node B updates its copy of A
> node A receives B's add 3, updates its copy of B
>
>
> then state is:
> node A , expected epoch == 1 [A:4 clock=400] [B:3 clock=500]
> node B same
>
>
>
> generally I think it should be complete if we add the following rule for
> inter-leader replication:
>
> each leader keeps a var in memory (and also persist to sstable when
> flushing) expected_epoch , initially set to 0
>
> node P does:
> on receiving updates from node Q
> if Q.expected_epoch > P.expected_epoch
> /** an epoch bump inherently means a previous delete, which
> we probably missed , so we need to apply the delete
> a delete is global to all leaders, so apply it on all my
> replicas **/
> for all leaders in my vector
> count = nil
>
> P.expected_epoch = Q.expected_epoch
> if Q.expected_epoch == P.expected_epoch
> update P's copy of Q according to standard rules
> /** if Q.expected_epoch < P.expected_epoch , that means Q is less
> up to date than us, just ignore
>
>
> replicate_on_write(to Q):
> if P.operation == delete
> P.expected_epoch ++
> set all my copies of all leaders to nil
> send to Q ( P.total , P.expected_epoch)
>
>
>
>
> overall I don't think delete being not commutative is a fundamental blocker
> : regular columns are also not commutative, yet we achieve stable result no
> matter what order they are applied, because of the ordering rule used in
> reconciliation; here we just need to find a similar ordering rule. the epoch
> thing could be a step on this direction.
>
>
> Thanks
> Yang
>
>
>
>
> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>
>> I don't think that's bulletproof either. For instance, what if the
>> two adds go to replica 1 but the delete to replica 2?
>>
>> Bottom line (and this was discussed on the original
>> delete-for-counters ticket,
>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
>> are not fully commutative which makes them fragile.
>>
>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy123@gmail.com> wrote:
>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>> > indicates, the problem with counter delete is in scenarios like the
>> > following:
>> > add 1, clock 100
>> > delete , clock 200
>> > add 2 , clock 300
>> > if the 1st and 3rd operations are merged in SStable compaction, then we
>> > have
>> > delete clock 200
>> > add 3, clock 300
>> > which shows wrong result.
>> >
>> > I think a relatively simple extension can be used to complete fix this
>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
>> clock,
>> > so that
>> > 1) a delete operation increases future epoch number by 1
>> > 2) merging of delta adds can be between only deltas of the same
>> epoch,
>> > deltas of older epoch are simply ignored during merging. merged result
>> keeps
>> > the epoch number of the newest seen.
>> > other operations remain the same as current. note that the above 2 rules
>> are
>> > only concerned with merging within the deltas on the leader, and not
>> related
>> > to the replicated count, which is a simple final state, and observes the
>> > rule of "larger clock trumps". naturally the ordering rule is:
>> epoch1.clock1
>> >> epoch2.clock2 iff epoch1 > epoch2 || epoch1 == epoch2 && clock1
>
>> clock2
>> > intuitively "epoch" can be seen as the serial number on a new
>> "incarnation"
>> > of a counter.
>> >
>> > code change should be mostly localized to CounterColumn.reconcile(),
>> > although, if an update does not find existing entry in memtable, we
>> need to
>> > go to sstable to fetch any possible epoch number, so
>> > compared to current write path, in the "no replicate-on-write" case, we
>> need
>> > to add a read to sstable. but in the "replicate-on-write" case, we
>> already
>> > read that, so it's no extra time cost. "no replicate-on-write" is not a
>> > very useful setup in reality anyway.
>> >
>> > does this sound a feasible way? if this works, expiring counter should
>> > also naturally work.
>> >
>> > Thanks
>> > Yang
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>
>
|