Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Thu, 18 Aug 2011 03:39:27 +0000 (UTC)
From: "Jonathan Ellis (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: 
 <1512854419.47751.1313638767942.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <214462522.40501.1306271694025.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Issue Comment Edited] (CASSANDRA-2699) continuous
 incremental anti-entropy
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086772#comment-13086772 ] 

Jonathan Ellis edited comment on CASSANDRA-2699 at 8/18/11 3:39 AM:
--------------------------------------------------------------------

bq. one way to incorporate terje's idea would be to continue to build merkle trees, but to only build them using sstables that have been created since the last time we ran repair

I'm not sure this works, because as compaction does its work you will have a lot of sstable turnover.  Even if you preserve repaired state where possible (two repaired sstables, merge to form an sstable also marked repaired) new sstables will "fragment" out as they're compacted under leveldb, "contaminating" what they are merged with.

(Note that the problem gets worse if a peer is down, because then (as Peter notes) un-repaired sstables start to pile up.  Maybe we can fall back to merkle trees if more than N% of sstables are un-repaired.)

You could create separate sets of levels for "repaired" and "unrepaired" sstables, I suppose.  That feels ugly.

You could also keep a "repaired" bloom filter at the row level for partially-repaired sstables.  That feels more reasonable to me.  (But that brings us back to doing repair-by-CL.ALL reads rather than trees of ranges.)

bq. one problem I see with CL.ALL reads is that you won't get new keys repaired to that node until another node is repaired that has the key

I don't think this is a blocker -- it just means you still have to run repair against each node, which has always been the case.

      was (Author: jbellis):
    bq. one way to incorporate terje's idea would be to continue to build merkle trees, but to only build them using sstables that have been created since the last time we ran repair

I'm not sure this works, because as compaction does its work you will have a lot of sstable turnover.  Even if you preserve repaired state where possible (two repaired sstables, merge to form an sstable also marked repaired) new sstables will "fragment" out as they're compacted under leveldb, "contaminating" what they are merged with.

Note that the problem gets worse if a peer is down, because then un-repaired sstables start to pile up.

You could create separate sets of levels for "repaired" and "unrepaired" sstables, I suppose.  That feels ugly.

You could also keep a "repaired" bloom filter at the row level for partially-repaired sstables.  That feels more reasonable to me.  (But that brings us back to doing repair-by-CL.ALL reads rather than trees of ranges.)

bq. one problem I see with CL.ALL reads is that you won't get new keys repaired to that node until another node is repaired that has the key

I don't think this is a blocker -- it just means you still have to run repair against each node, which has always been the case.
  
> continuous incremental anti-entropy
> -----------------------------------
>
>                 Key: CASSANDRA-2699
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2699
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>
> Currently, repair works by periodically running "bulk" jobs that (1)
> performs a validating compaction building up an in-memory merkle tree,
> and (2) streaming ring segments as needed according to differences
> indicated by the merkle tree.
> There are some disadvantages to this approach:
> * There is a trade-off between memory usage and the precision of the
>   merkle tree. Less precision means more data streamed relative to
>   what is strictly required.
> * Repair is a periodic "bulk" process that runs for a significant
>   period and, although possibly rate limited as compaction (if 0.8 or
>   backported throttling patch applied), is a divergence in terms of
>   performance characteristics from "normal" operation of the cluster.
> * The impact of imprecision can be huge on a workload dominated by I/O
>   and with cache locality being critical, since you will suddenly
>   transfers lots of data to the target node.
> I propose a more incremental process whereby anti-entropy is
> incremental and continuous over time. In order to avoid being
> seek-bound one still wants to do work in some form of bursty fashion,
> but the amount of data processed at a time could be sufficiently small
> that the impact on the cluster feels a lot more continuous, and that
> the page cache allows us to avoid re-reading differing data twice.
> Consider a process whereby a node is constantly performing a per-CF
> repair operation for each CF. The current state of the repair process
> is defined by:
> * A starting timestamp of the current iteration through the token
>   range the node is responsible for.
> * A "finger" indicating the current position along the token ring to
>   which iteration has completed.
> This information, other than being in-memory, could periodically (every
> few minutes or something) be stored persistently on disk.
> The finger advances by the node selecting the next small "bit" of the
> ring and doing whatever merkling/hashing/checksumming is necessary on
> that small part, and then asking neighbors to do the same, and
> arranging for neighbors to send the node data for mismatching
> ranges. The data would be sent either by way of mutations like with
> read repair, or by streaming sstables. But it would be small amounts
> of data that will act roughly the same as regular writes for the
> perspective of compaction.
> Some nice properties of this approach:
> * It's "always on"; no periodic sudden effects on cluster performance.
> * Restarting nodes never cancels or breaks anti-entropy.
> * Huge compactions of entire CF:s never clog up the compaction queue
>   (not necessarily a non-issue even with concurrent compactions in
>   0.8).
> * Because we're always operating on small chunks, there is never the
>   same kind of trade-off for memory use. A merkel tree or similar
>   could be calculated at a very detailed level potentially. Although
>   the precision from the perspective of reading from disk would likely
>   not matter much if we are in page cache anyway, very high precision
>   could be *very* useful when doing anti-entropy across data centers
>   on slow links.
> There are devils in details, like how to select an appropriate ring
> segment given that you don't have knowledge of the data density on
> other nodes. But I feel that the overall idea/process seems very
> promising.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira