incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang <teddyyyy...@gmail.com>
Subject Re: one way to make counter delete work better
Date Wed, 15 Jun 2011 02:02:56 GMT
I almost got the code done, should release in a bit.



your scenario is not a problem concerned with implementation, but really
with definition of "same time". remember that in a distributed system, there
is no absolute physical time concept, time is just another way of saying
"before or after". in your scenario, since DCA and DCB are cut off, and
there are no messages between them, you can NOT determine logically whether
you should say the delete is before +3 or after it. you may say "hey, the
timestamp I gave +3 is higher", but DCA may say:" your timestamp is just
drifted, actually my delete happened later"

in fact here is a stronger reason that you have to let go of the +3, because
it might have already been merged up by +1 , which happened in physical time
earlier than our DCA delete, and a +2 which happened after the DCA delete,
now what would you say about whether the +3 is before or after our DCA
delete? the only correct way to order them is to say:" sorry DCB: you missed
the delete, all your latter +2 operations were just a snapshot earlier in
time, the eventual result is the delete. ---- in other words, it is futile
to update on a dead epoch while others have started a new one". this is the
same dilemma that you face during sstable merging

overall, I think it's easier to understand it if we realize that once you
delete, all further edits on the counter is futile, epoch is another way of
saying creating a completely new counter, the counter name we are using is
just kind of an alias.


yang


On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne <sylvain@datastax.com>wrote:

> Who assigns those epoch numbers ?
> You need all nodes to agree on the epoch number somehow to have this work,
> but then how do you maintain those in a partition tolerant distributed
> system ?
>
> I may have missed some parts of your proposal but let me consider a
> scenario
> that we have to be able to handle: consider two nodes A and B (RF=2) each
> in
> one data center (DCA and DCB) and a counter c. Suppose you do a +2
> increment
> on c that both nodes get. Now let say you have a network split and the
> connection
> between your 2 data center fails. In DCA you delete c, only A gets it.
> In DCB, you
> do more increments on c (say +3), only B gets it. The partition can
> last for hours.
> For deletion to work, we would need that whenever the network
> partition is resolved,
> both node eventually agree on the value 3 (i.e, only the second increment).
> I don't see how you could assign epoch numbers or anything to fix that.
>
> --
> Sylvain
>
> On Mon, Jun 13, 2011 at 8:26 PM, Yang <teddyyyy123@gmail.com> wrote:
> > ok, I think it's better to understand it this way, then it is really
> simple
> > and intuitive:
> > my proposed way of counter update can be simply seen as a combination of
> > regular columns + current counter columns:
> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
> epoch
> > number]
> > then within each epoch, counter updates work as currently implemented
> >
> >
> > On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy123@gmail.com> wrote:
> >>
> >> I think this approach also works for your scenario:
> >> I thought that the issue is only concerned with merging within the same
> >> leader; but you pointed out
> >> that a similar merging happens between leaders too, now I see that the
> >> same rules on epoch number
> >> also applies to inter-leader data merging, specifically in your case:
> >>
> >> everyone starts with epoch of 0, ( they should be same, if not, it also
> >> works, we just consider them to be representing diffferent time
> snapshots of
> >> the same counter state)
> >> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
> >> node A      delete    clock:  0.200
> >> node B     add 2     clock:  0.300
> >> node A    gets B's state:  add 2 clock 0.300, but rejects it because A
> has
> >> already produced a delete, with epoch of 0, so A considers epoch 0
> already
> >> ended, it won't accept any replicated state with epoch < 1.
> >> node B    gets A's delete  0.200,  it zeros its own count of "2", and
> >> updates its future expected epoch to 1.
> >> at this time, the state of system is:
> >> node A     expected epoch =1  [A:nil] [B:nil]
> >> same for node B
> >>
> >>
> >> let's say we have following further writes:
> >> node B  add 3  clock  1.400
> >> node A adds 4  clock 1.500
> >> node B receives A's add 4,   node B updates its copy of A
> >> node A receives B's add 3,    updates its copy of B
> >>
> >> then state is:
> >> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
> >> node B same
> >>
> >>
> >> generally I think it should be complete if we add the following rule for
> >> inter-leader replication:
> >> each leader keeps a var in memory (and also persist to sstable when
> >> flushing)  expected_epoch , initially set to 0
> >> node P does:
> >> on receiving updates from  node Q
> >>         if Q.expected_epoch > P.expected_epoch
> >>               /** an epoch bump inherently means a previous delete,
> which
> >> we probably missed , so we need to apply the delete
> >>                   a delete is global to all leaders, so apply it on all
> my
> >> replicas **/
> >>              for all leaders in my vector
> >>                   count = nil
> >>
> >>              P.expected_epoch =  Q.expected_epoch
> >>         if Q.expected_epoch == P.expected_epoch
> >>              update P's copy of Q according to standard rules
> >>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is
> less
> >> up to date than us, just ignore
> >>
> >> replicate_on_write(to Q):
> >>       if  P.operation == delete
> >>             P.expected_epoch ++
> >>             set all my copies of all leaders to nil
> >>       send to Q ( P.total , P.expected_epoch)
> >>
> >>
> >>
> >> overall I don't think delete being not commutative is a fundamental
> >> blocker : regular columns are also not commutative, yet we achieve
> stable
> >> result no matter what order they are applied, because of the ordering
> rule
> >> used in reconciliation; here we just need to find a similar ordering
> rule.
> >> the epoch thing could be a step on this direction.
> >>
> >> Thanks
> >> Yang
> >>
> >>
> >>
> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbellis@gmail.com>
> wrote:
> >>>
> >>> I don't think that's bulletproof either.  For instance, what if the
> >>> two adds go to replica 1 but the delete to replica 2?
> >>>
> >>> Bottom line (and this was discussed on the original
> >>> delete-for-counters ticket,
> >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
> >>> are not fully commutative which makes them fragile.
> >>>
> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy123@gmail.com> wrote:
> >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
> >>> > indicates, the problem with counter delete is  in scenarios like the
> >>> > following:
> >>> > add 1, clock 100
> >>> > delete , clock 200
> >>> > add  2 , clock 300
> >>> > if the 1st and 3rd operations are merged in SStable compaction, then
> we
> >>> > have
> >>> > delete  clock 200
> >>> > add 3,  clock 300
> >>> > which shows wrong result.
> >>> >
> >>> > I think a relatively simple extension can be used to complete fix
> this
> >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
> >>> > clock,
> >>> > so that
> >>> >    1) a delete operation increases future epoch number by 1
> >>> >    2) merging of delta adds can be between only deltas of the same
> >>> > epoch,
> >>> > deltas of older epoch are simply ignored during merging. merged
> result
> >>> > keeps
> >>> > the epoch number of the newest seen.
> >>> > other operations remain the same as current. note that the above 2
> >>> > rules are
> >>> > only concerned with merging within the deltas on the leader, and not
> >>> > related
> >>> > to the replicated count, which is a simple final state, and observes
> >>> > the
> >>> > rule of "larger clock trumps". naturally the ordering rule is:
> >>> > epoch1.clock1
> >>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 &&
clock1 >
> >>> >> clock2
> >>> > intuitively "epoch" can be seen as the serial number on a new
> >>> > "incarnation"
> >>> > of a counter.
> >>> >
> >>> > code change should be mostly localized to CounterColumn.reconcile(),
> >>> >  although, if an update does not find existing entry in memtable, we
> >>> > need to
> >>> > go to sstable to fetch any possible epoch number, so
> >>> > compared to current write path, in the "no replicate-on-write" case,
> we
> >>> > need
> >>> > to add a read to sstable. but in the "replicate-on-write" case, we
> >>> > already
> >>> > read that, so it's no extra time cost.  "no replicate-on-write" is
> not
> >>> > a
> >>> > very useful setup in reality anyway.
> >>> >
> >>> > does this sound a feasible way?   if this works, expiring counter
> >>> > should
> >>> > also naturally work.
> >>> >
> >>> > Thanks
> >>> > Yang
> >>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> Project Chair, Apache Cassandra
> >>> co-founder of DataStax, the source for professional Cassandra support
> >>> http://www.datastax.com
> >>
> >
> >
>

Mime
View raw message