Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4AE7A62BF for ; Thu, 21 Jul 2011 19:32:00 +0000 (UTC) Received: (qmail 20355 invoked by uid 500); 21 Jul 2011 19:31:58 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 20331 invoked by uid 500); 21 Jul 2011 19:31:57 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 20323 invoked by uid 99); 21 Jul 2011 19:31:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Jul 2011 19:31:57 +0000 X-ASF-Spam-Status: No, hits=1.0 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HK_RANDOM_ENVFROM,HK_RANDOM_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of teddyyyy123@gmail.com designates 209.85.160.172 as permitted sender) Received: from [209.85.160.172] (HELO mail-gy0-f172.google.com) (209.85.160.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Jul 2011 19:31:50 +0000 Received: by gyd5 with SMTP id 5so887869gyd.31 for ; Thu, 21 Jul 2011 12:31:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=QnFWGaPciE2ugTd/mLjVE13xQN8ztVKCmRjsvAW/4JY=; b=HMWsXDVmmqvucHNm4ivzyhp14wPnH0uNKztRixWJ0FbMF8kpv5HudiFhDgNhL06pU/ dc+QaXmkK+NP/hP8wEWwAc0iPsWFLA1k7khFA0/v+zGuh2CY8rMJjcaXhnfZvlVscuXi s9bbC2LdTjmhkNN5izehWYF+OSprLB4Ny9TdI= MIME-Version: 1.0 Received: by 10.236.78.102 with SMTP id f66mr1102713yhe.7.1311276689913; Thu, 21 Jul 2011 12:31:29 -0700 (PDT) Received: by 10.236.61.4 with HTTP; Thu, 21 Jul 2011 12:31:29 -0700 (PDT) In-Reply-To: <00151748e728155e0d04a89636fb@google.com> References: <00151748e728155e0d04a89636fb@google.com> Date: Thu, 21 Jul 2011 12:31:29 -0700 Message-ID: Subject: Re: Re: Repair question - why is so much data transferred? From: Yang To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org I have been thinking about the problem of repair for a while. if we do not consider the need for partition-tolerance, then the eventual consistency approach is probably the ultimate reason for needing repairs: compared to Zookeeper/Spinnaker (recent VLDB paper)/Chubby/HBase, those systems only need to bring up a node to date at the *end* of write history, cuz everyone's write history forms a prefix of the real history; but Dynamo-systems unnecessarily creates many "holes" in history because any writes can be missed, as a result you have to do the expensive scan for repair. in other words, by design, those other systems can find out the discrepancies at zero cost, while dynamo systems needs to regenerate the expensive merkle tree. I've been thinking about implementing the Zookeeper protocol for some optional CFs that want to use HBase-style replication (single write point/master within replica set, with master being leader-elected), this would be similar to Spinnaker except that we do not actually use ZK (relying on external disconnection notification has some rare chances of master conflict, plus the extra component dependency. with the sending/acking traffic patterns already in Cassandra, it's actually easier to add the ZAB protocol directly). this way no repair would be needed for such CFs. yang On Thu, Jul 21, 2011 at 8:43 AM, wrote: > from ticket 2818: > "One (reasonably simple) proposition to fix this would be to have repair > schedule validation compactions across nodes one by one (i.e, one CF/rang= e > at a time), waiting for all nodes to return their tree before submitting = the > next request. Then on each node, we should make sure that the node will > start the validation compaction as soon as requested. For that, we probab= ly > want to have a specific executor for validation compaction" > > .. This was the way I thought repair worked. > > Anyway, in our case, we only have one CF, so I'm not sure if both issues > apply to my situation. > > Thanks. Looking forward to the release where these 2 things are fixed. > > On , Jonathan Ellis wrote: >> On Thu, Jul 21, 2011 at 9:14 AM, Jonathan Colby >> >> jonathan.colby@gmail.com> wrote: >> >> > I regularly run repair on my cassandra cluster. =A0 However, I often s= een >> > that during the repair operation very large amounts of data are transf= erred >> > to other nodes. >> >> >> >> https://issues.apache.org/jira/browse/CASSANDRA-2280 >> >> https://issues.apache.org/jira/browse/CASSANDRA-2816 >> >> >> >> > My questions is, if only some data is out of sync, =A0why are entire D= ata >> > files being transferred? >> >> >> >> Repair streams ranges of files as a unit (which becomes a new file on >> >> the target node) rather than using the normal write path. >> >> >> >> -- >> >> Jonathan Ellis >> >> Project Chair, Apache Cassandra >> >> co-founder of DataStax, the source for professional Cassandra support >> >> http://www.datastax.com >>