Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of teddyyyy123@gmail.com
 designates 209.85.160.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <00151748e728155e0d04a89636fb@google.com>
References: <00151748e728155e0d04a89636fb@google.com>
Date: Thu, 21 Jul 2011 12:31:29 -0700
Message-ID: 
 <CAAnh3_8LzFX-=cjfYXsGqXRVONfOq4uFrs1A-8pvj7VMKPVYOA@mail.gmail.com>
Subject: Re: Re: Repair question - why is so much data transferred?
From: Yang <teddyyyy123@gmail.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I have been thinking about the problem of repair for a while.

if we do not consider the need for partition-tolerance, then the
eventual consistency approach is probably the ultimate reason for
needing repairs: compared to Zookeeper/Spinnaker (recent VLDB
paper)/Chubby/HBase, those systems only need to bring up a node to
date at the *end* of write history, cuz everyone's write history forms
a prefix of the real history; but Dynamo-systems unnecessarily creates
many "holes" in history because any writes can be missed, as a result
you have to do the expensive scan for repair.  in other words, by
design, those other systems can find out the discrepancies at zero
cost, while dynamo systems needs to regenerate the expensive merkle
tree.


I've been thinking about implementing the Zookeeper protocol for some
optional CFs that want to use HBase-style replication (single write
point/master within replica set, with master being leader-elected),
this would be similar to Spinnaker except that we do not actually use
ZK (relying on external disconnection notification has some rare
chances of master conflict, plus the extra component dependency. with
the sending/acking traffic patterns already in Cassandra, it's
actually easier to add the ZAB protocol directly).   this way no
repair would be needed for such CFs.


yang

On Thu, Jul 21, 2011 at 8:43 AM,  <jonathan.colby@gmail.com> wrote:
> from ticket 2818:
> "One (reasonably simple) proposition to fix this would be to have repair
> schedule validation compactions across nodes one by one (i.e, one CF/rang=
e
> at a time), waiting for all nodes to return their tree before submitting =
the
> next request. Then on each node, we should make sure that the node will
> start the validation compaction as soon as requested. For that, we probab=
ly
> want to have a specific executor for validation compaction"
>
> .. This was the way I thought repair worked.
>
> Anyway, in our case, we only have one CF, so I'm not sure if both issues
> apply to my situation.
>
> Thanks. Looking forward to the release where these 2 things are fixed.
>
> On , Jonathan Ellis <jbellis@gmail.com> wrote:
>> On Thu, Jul 21, 2011 at 9:14 AM, Jonathan Colby
>>
>> jonathan.colby@gmail.com> wrote:
>>
>> > I regularly run repair on my cassandra cluster. =A0 However, I often s=
een
>> > that during the repair operation very large amounts of data are transf=
erred
>> > to other nodes.
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-2280
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-2816
>>
>>
>>
>> > My questions is, if only some data is out of sync, =A0why are entire D=
ata
>> > files being transferred?
>>
>>
>>
>> Repair streams ranges of files as a unit (which becomes a new file on
>>
>> the target node) rather than using the normal write path.
>>
>>
>>
>> --
>>
>> Jonathan Ellis
>>
>> Project Chair, Apache Cassandra
>>
>> co-founder of DataStax, the source for professional Cassandra support
>>
>> http://www.datastax.com
>>