cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yang Yang (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (CASSANDRA-2774) one way to make counter delete work better
Date Mon, 20 Jun 2011 15:16:48 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052007#comment-13052007
] 

Yang Yang edited comment on CASSANDRA-2774 at 6/20/11 3:15 PM:
---------------------------------------------------------------

Thanks a lot Sylvain for looking through this.


2 answers to your last comment:

1) "__it is impossible for other nodes to have a version for A's shard that is greater than
what A has__", actually it is possible. D is able to have a higher version(clock) for A-shard
because A's A-shard clock was decremented from some time before, due to wipe out by the delete.
D *did not increment* the clock of A-shard.  in other words, the result "D having a higher
A-shard clock than A" can be achieved by either "D increments the A-shard clock" (which is
impossible) *or* "A decrements its A-shard clock" which is possible through the deletion.

2) regarding your example. 
if the 3 operations are in 3 sstables, which are merged in the following order:
+1
+1
Delete

the second +1 will already have an epoch of 1 (which you pointed out , through inheritance),
while the first one has an epoch of 0, 
in the new reconcile() rules (line 172--178 in the patch), because the second +1 has a different,
higher epoch (timestampOfLastDelete) than 
the first one, the first +1 will be thrown away. so in this example, the final result will
always be +1.

the central point of the change is that we do not ever look at timestamp() again, only look
at timestampOfLastDelete  (timestamp() is only used
to assign timestampOfLastDelete for delete operations)


Thanks
Yang

      was (Author: yangyangyyy):
    Thanks a lot Sylvain for looking through this.


2 answers to your last comment:

1) "it is impossible for other nodes to have a version for A's shard that is greater than
what A has", actually it is possible. D is able to have a higher version(clock) for A-shard
because A's A-shard clock was decremented from some time before, due to wipe out by the delete.
D *did not increment* the clock of A-shard.  in other words, the result "D having a higher
A-shard clock than A" can be achieved by either "D increments the A-shard clock" (which is
impossible) *or* "A decrements its A-shard clock" which is possible through the deletion.

2) regarding your example. 
if the 3 operations are in 3 sstables, which are merged in the following order:
+1
+1
Delete

the second +1 will already have an epoch of 1 (which you pointed out , through inheritance),
while the first one has an epoch of 0, 
in the new reconcile() rules (line 172--178 in the patch), because the second +1 has a different,
higher epoch (timestampOfLastDelete) than 
the first one, the first +1 will be thrown away. so in this example, the final result will
always be +1.

the central point of the change is that we do not ever look at timestamp() again, only look
at timestampOfLastDelete  (timestamp() is only used
to assign timestampOfLastDelete for delete operations)


Thanks
Yang
  
> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables
would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the
delete. supposedly a delete is to create a completely new incarnation of the counter, or a
new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that
each delete bumps up the epoch number. since each write is replicated (replicate on write
is almost always enabled in practice, if this is a concern, we could further force ROW in
case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic
is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the
fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case
of long network partition, the behavior is not ideal, but it's consistent with the definition
of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message