Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1F1E39DBF for ; Sat, 14 Apr 2012 06:54:50 +0000 (UTC) Received: (qmail 50178 invoked by uid 500); 14 Apr 2012 06:54:47 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 49954 invoked by uid 500); 14 Apr 2012 06:54:41 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 49918 invoked by uid 99); 14 Apr 2012 06:54:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Apr 2012 06:54:39 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of schumi.han@gmail.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-wg0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Apr 2012 06:54:35 +0000 Received: by wgbdr13 with SMTP id dr13so2899106wgb.25 for ; Fri, 13 Apr 2012 23:54:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=RKzPS+A3qWWb4itQ4Ny45aX1GQJHK1qTVqqzycYhgZE=; b=Q3J9Y4TQe5mRwIdWkXNSzSCPO/3O8K6Rl4nRcF2Imfc1nXqPS6NrvndM+mRQHqRULo ujbzpwjr6LXBWRm+BuBE8kd6Jfup4VLt9oSzyQmAGrez6R5gU7afh7efS7LNIRCNaH2k f2EbKZrqZOUX5kBUb+EeE+cIOX/O+YVyLWHlgQ438+kygT2hNYZQ79FNXsyklVN1wife x7T7cIESL1CNy+yfRsz7fvCeYyIhXDQcy5rDBLOMgH7dVMknmdYWtHW0fvSfZ6oM9C7E glomHWltLTTMKI3LX3a0gqM9uQFGaVcpZ0LtwvhwSqE5k47ICI+fgiQvpD7aEBYeAnC4 3fEA== MIME-Version: 1.0 Received: by 10.180.102.101 with SMTP id fn5mr2251435wib.6.1334386454043; Fri, 13 Apr 2012 23:54:14 -0700 (PDT) Sender: schumi.han@gmail.com Received: by 10.216.70.201 with HTTP; Fri, 13 Apr 2012 23:54:13 -0700 (PDT) In-Reply-To: <4F8911DA.202@4friends.od.ua> References: <4F8446F0.7070906@4friends.od.ua> <4F84639D.7050304@4friends.od.ua> <399A1DFA-D0AC-4355-B2AC-3A0DFCFB6ADE@thelastpickle.com> <4F8911DA.202@4friends.od.ua> Date: Sat, 14 Apr 2012 14:54:13 +0800 X-Google-Sender-Auth: TKlJTtOdHuyILdqP136At0uEefY Message-ID: Subject: Re: Repair Process Taking too long From: Zhu Han To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=f46d0444ef43961a8204bd9e0e00 X-Virus-Checked: Checked by ClamAV on apache.org --f46d0444ef43961a8204bd9e0e00 Content-Type: text/plain; charset=ISO-8859-1 On Sat, Apr 14, 2012 at 1:57 PM, Igor wrote: > Hi! > > What is the difference between 'repair' and '-pr repair'? Simple repair > touch all token ranges (for all nodes) and -pr touch only range for which > given node responsible? > > -pr only touches the primary range of the node. If you executes -pr against all nodes in replica groups, then all ranges are repaired. > > > On 04/12/2012 05:59 PM, Sylvain Lebresne wrote: > >> On Thu, Apr 12, 2012 at 4:06 PM, Frank Ng wrote: >> >>> I also noticed that if I use the -pr option, the repair process went down >>> from 30 hours to 9 hours. Is the -pr option safe to use if I want to run >>> repair processes in parallel on nodes that are not replication peers? >>> >> There is pretty much two use case for repair: >> 1) to rebuild a node: if say a node has lost some data due to a hard >> drive corruption or the like and you want to to rebuild what's missing >> 2) the periodic repairs to avoid problem with deleted data coming back >> from the dead (basically: >> http://wiki.apache.org/**cassandra/Operations#** >> Frequency_of_nodetool_repair >> ) >> >> In case 1) you want to run 'nodetool repair' (without -pr) against the >> node to rebuild. >> In case 2) (which I suspect is the case your talking now), you *want* >> to use 'nodetool repair -pr' on *every* node of the cluster. I.e. >> that's the most efficient way to do it. The only reason not to use -pr >> in this case would be that it's not available because you're using an >> old version of Cassandra. And yes, it's is safe to run with -pr in >> parallel on nodes that are not replication peers. >> >> -- >> Sylvain >> >> >> thanks >>> >>> >>> On Thu, Apr 12, 2012 at 12:06 AM, Frank Ng wrote: >>> >>>> Thank you for confirming that the per node data size is most likely >>>> causing the long repair process. I have tried a repair on smaller >>>> column >>>> families and it was significantly faster. >>>> >>>> On Wed, Apr 11, 2012 at 9:55 PM, aaron morton>>> > >>>> wrote: >>>> >>>>> If you have 1TB of data it will take a long time to repair. Every bit >>>>> of >>>>> data has to be read and a hash generated. This is one of the reasons we >>>>> often suggest that around 300 to 400Gb per node is a good load in the >>>>> general case. >>>>> >>>>> Look at nodetool compactionstats .Is there a validation compaction >>>>> running ? If so it is still building the merkle hash tree. >>>>> >>>>> Look at nodetool netstats . Is it streaming data ? If so all hash trees >>>>> have been calculated. >>>>> >>>>> Cheers >>>>> >>>>> >>>>> ----------------- >>>>> Aaron Morton >>>>> Freelance Developer >>>>> @aaronmorton >>>>> http://www.thelastpickle.com >>>>> >>>>> On 12/04/2012, at 2:16 AM, Frank Ng wrote: >>>>> >>>>> Can you expand further on your issue? Were you using Random Patitioner? >>>>> >>>>> thanks >>>>> >>>>> On Tue, Apr 10, 2012 at 5:35 PM, David Leimbach >>>>> wrote: >>>>> >>>>>> I had this happen when I had really poorly generated tokens for the >>>>>> ring. Cassandra seems to accept numbers that are too big. You get >>>>>> hot >>>>>> spots when you think you should be balanced and repair never ends (I >>>>>> think >>>>>> there is a 48 hour timeout). >>>>>> >>>>>> >>>>>> On Tuesday, April 10, 2012, Frank Ng wrote: >>>>>> >>>>>>> I am not using tier-sized compaction. >>>>>>> >>>>>>> >>>>>>> On Tue, Apr 10, 2012 at 12:56 PM, Jonathan Rhone >>>>>>> wrote: >>>>>>> >>>>>>>> Data size, number of nodes, RF? >>>>>>>> >>>>>>>> Are you using size-tiered compaction on any of the column families >>>>>>>> that hold a lot of your data? >>>>>>>> >>>>>>>> Do your cassandra logs say you are streaming a lot of ranges? >>>>>>>> zgrep -E "(Performing streaming repair|out of sync)" >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Apr 10, 2012 at 9:45 AM, Igor wrote: >>>>>>>> >>>>>>>>> On 04/10/2012 07:16 PM, Frank Ng wrote: >>>>>>>>> >>>>>>>>> Short answer - yes. >>>>>>>>> But you are asking wrong question. >>>>>>>>> >>>>>>>>> >>>>>>>>> I think both processes are taking a while. When it starts up, >>>>>>>>> netstats and compactionstats show nothing. Anyone out there >>>>>>>>> successfully >>>>>>>>> using ext3 and their repair processes are faster than this? >>>>>>>>> >>>>>>>>> On Tue, Apr 10, 2012 at 10:42 AM, Igor >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi >>>>>>>>>> >>>>>>>>>> You can check with nodetool which part of repair process is slow >>>>>>>>>> - >>>>>>>>>> network streams or verify compactions. use nodetool netstats or >>>>>>>>>> compactionstats. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 04/10/2012 05:16 PM, Frank Ng wrote: >>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> I am on Cassandra 1.0.7. My repair processes are taking over 30 >>>>>>>>>>> hours to complete. Is it normal for the repair process to take >>>>>>>>>>> this long? >>>>>>>>>>> I wonder if it's because I am using the ext3 file system. >>>>>>>>>>> >>>>>>>>>>> thanks >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Jonathan Rhone >>>>>>>> Software Engineer >>>>>>>> >>>>>>>> TinyCo >>>>>>>> 800 Market St., Fl 6 >>>>>>>> San Francisco, CA 94102 >>>>>>>> www.tinyco.com >>>>>>>> >>>>>>>> >>>>> > --f46d0444ef43961a8204bd9e0e00 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
On Sat, Apr 14, 2012 at 1:57 PM, Igor <igor@4friends.od.ua<= /a>> wrote:
Hi!

What is the difference between 'repair' and '-pr repair'? S= imple repair touch all token ranges (for all nodes) and -pr touch only rang= e for which given node responsible?


-pr only touches the primary range of= the node.=A0 If you executes -pr against all nodes in replica groups,=A0 t= hen all ranges are repaired.


On 04/12/2012 05:59 PM, Sylvain Lebresne wrote:
On Thu, Apr 12, 2012 at 4:06 PM, Frank Ng<buzztemk@gmail.com> =A0wrote:
I also noticed that if I use the -pr option, the repair process went down from 30 hours to 9 hours. =A0Is the -pr option safe to use if I want to run=
repair processes in parallel on nodes that are not replication peers?
There is pretty much two use case for repair:
1) to rebuild a node: if say a node has lost some data due to a hard
drive corruption or the like and you want to to rebuild what's missing<= br> 2) the periodic repairs to avoid problem with deleted data coming back
from the dead (basically:
http://wiki.apache.org/cassandra/Operati= ons#Frequency_of_nodetool_repair)

In case 1) you want to run 'nodetool repair' (without -pr) against = the
node to rebuild.
In case 2) (which I suspect is the case your talking now), you *want*
to use 'nodetool repair -pr' on *every* node of the cluster. I.e. that's the most efficient way to do it. The only reason not to use -pr<= br> in this case would be that it's not available because you're using = an
old version of Cassandra. And yes, it's is safe to run with -pr in
parallel on nodes that are not replication peers.

--
Sylvain


thanks


On Thu, Apr 12, 2012 at 12:06 AM, Frank Ng<berrytemk@gmail.com> =A0wrote:
Thank you for confirming that the per node data size is most likely
causing the long repair process. =A0I have tried a repair on smaller column=
families and it was significantly faster.

On Wed, Apr 11, 2012 at 9:55 PM, aaron morton<aaron@thelastpickle.com> wrote:
If you have 1TB of data it will take a long time to repair. Every bit of data has to be read and a hash generated. This is one of the reasons we
often suggest that around 300 to 400Gb per node is a good load in the
general case.

Look at nodetool compactionstats .Is there a validation compaction
running ? If so it is still building the merkle =A0hash tree.

Look at nodetool netstats . Is it streaming data ? If so all hash trees
have been calculated.

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thela= stpickle.com

On 12/04/2012, at 2:16 AM, Frank Ng wrote:

Can you expand further on your issue? Were you using Random Patitioner?

thanks

On Tue, Apr 10, 2012 at 5:35 PM, David Leimbach<leimy2k@gmail.com>
wrote:
I had this happen when I had really poorly generated tokens for the
ring. =A0Cassandra seems to accept numbers that are too big. =A0You get hot=
spots when you think you should be balanced and repair never ends (I think<= br> there is a 48 hour timeout).


On Tuesday, April 10, 2012, Frank Ng wrote:
I am not using tier-sized compaction.


On Tue, Apr 10, 2012 at 12:56 PM, Jonathan Rhone<rhone@tinyco.com>
wrote:
Data size, number of nodes, RF?

Are you using size-tiered compaction on any of the column families
that hold a lot of your data?

Do your cassandra logs say you are streaming a lot of ranges?
zgrep -E "(Performing streaming repair|out of sync)"


On Tue, Apr 10, 2012 at 9:45 AM, Igor<igor@4friends.od.ua> =A0wrote:
On 04/10/2012 07:16 PM, Frank Ng wrote:

Short answer - yes.
But you are asking wrong question.


I think both processes are taking a while. =A0When it starts up,
netstats and compactionstats show nothing. =A0Anyone out there successfully=
using ext3 and their repair processes are faster than this?

On Tue, Apr 10, 2012 at 10:42 AM, Igor<igor@4friends.od.ua> =A0wrote:
Hi

You can check with nodetool =A0which part of repair process is slow -
network streams or verify compactions. use nodetool netstats or
compactionstats.


On 04/10/2012 05:16 PM, Frank Ng wrote:
Hello,

I am on Cassandra 1.0.7. =A0My repair processes are taking over 30
hours to complete. =A0Is it normal for the repair process to take this long= ?
=A0I wonder if it's because I am using the ext3 file system.

thanks




--
Jonathan Rhone
Software Engineer

TinyCo
800 Market St., Fl 6
San Francisco, CA 94102
www.tinyco.com




--f46d0444ef43961a8204bd9e0e00--