Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1AEADC261 for ; Wed, 9 May 2012 12:50:13 +0000 (UTC) Received: (qmail 37786 invoked by uid 500); 9 May 2012 12:50:10 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 37756 invoked by uid 500); 9 May 2012 12:50:10 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 37740 invoked by uid 99); 9 May 2012 12:50:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 May 2012 12:50:10 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bill.w.au@gmail.com designates 209.85.213.44 as permitted sender) Received: from [209.85.213.44] (HELO mail-yw0-f44.google.com) (209.85.213.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 May 2012 12:50:06 +0000 Received: by yhq56 with SMTP id 56so288003yhq.31 for ; Wed, 09 May 2012 05:49:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Qb7n8d3K5VY3yFSb7rZstefDx42YUNl7YLGqLuTgewA=; b=c4UxpHzEAFuvzXC4MC3Iuzc+9JzIunMKX9ESOmtj2zyCcbmz5wQA/biKmqmlUSymVg mmdoyiMJdl3FU7sjYvzC5RSAQdIuDynNRC02Bs+w20AV7g6k7+KagKyk8f4e4Iu5deIs Ya8r2aWNrPLEHxXWb85uopAxHLEdlpbwThz34TOtvy4/n9qNHov4/SctLj9bEus1de0h DTxmpIqLRhS7Kfu11WPeKcZ/M7WSprs4KHLdmQOrvZcKexvAFgt+uWE4zze1OjZFC0F1 U4qdw+LNmf5oyPp+G8YhDtlPfKGohIqpRW7Jd2DSkkpY6jkWdcQnCxmmLO8suprZNkHG aVuA== MIME-Version: 1.0 Received: by 10.60.14.41 with SMTP id m9mr19313914oec.57.1336567785389; Wed, 09 May 2012 05:49:45 -0700 (PDT) Received: by 10.182.45.7 with HTTP; Wed, 9 May 2012 05:49:45 -0700 (PDT) In-Reply-To: References: <15CE4CB3-EA3C-4C49-B0F9-1534D81A92DA@thelastpickle.com> Date: Wed, 9 May 2012 08:49:45 -0400 Message-ID: Subject: Re: getting status of long running repair From: Bill Au To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=e89a8fb1fac010fe4604bf99f08e X-Virus-Checked: Checked by ClamAV on apache.org --e89a8fb1fac010fe4604bf99f08e Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I am running 1.0.8. Two data center with 8 machines in each dc. Nodes are all up while repairing is running. No dropped Mutations/Messages. I do see HintedHandoff messages. Bill On Tue, May 8, 2012 at 11:15 PM, Vijay wrote: > What is the version you are using? is it Multi DC setup? Are you seeing a > lot of dropped Mutations/Messages? Are the nodes going up and down all th= e > time while the repair is running? > > Regards, > > > > > > On Tue, May 8, 2012 at 2:05 PM, Bill Au wrote: > >> There are no error message in my log. >> >> I ended up restarting all the nodes in my cluster. After that I was abl= e >> to run repair successfully on one of the node. It took about 40 minutes= . >> Feeling lucky I ran repair on another node and it is stuck again. >> >> tpstats show 1 active and 1 pending AntiEntropySessions. netstats and >> compactionstats show no activity. I took a close look at the log file, = it >> shows that the node requested merkle tree from 4 nodes (including itself= ). >> It actually received 3 of those merkle trees. It looks like it is stuck >> waiting for that last one. I checked the node where the request was sen= t >> to, there isn't anything in the log on repair. So it looks like the mer= kle >> tree request has gotten lost some how. It has been 8 hours since the >> repair was issue and it is still stuck. I am going to let it run a bit >> longer to see if it will eventually finish. >> >> I have observed that if I restart all the nodes, I would be able to run >> repair successfully on a single node. I have done that twice already. = But >> after that all repairs will hang. Since we are supposed to run repair >> periodically, having to restart all nodes before running repair on each >> node isn't really viable for us. >> >> Bill >> >> >> On Tue, May 8, 2012 at 6:04 AM, aaron morton wr= ote: >> >>> When you look in the logs please let me know if you see this error=85 >>> https://issues.apache.org/jira/browse/CASSANDRA-4223 >>> >>> I look at nodetool compactionstats (for the Merkle tree phase), >>> nodetool netstats for the streaming, and this to check for streaming >>> progress: >>> >>> while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 &= & >>> nodetool -h localhost netstats); done >>> >>> Or use Data Stax Ops Centre where possible >>> http://www.datastax.com/products/opscenter >>> >>> Cheers >>> >>> >>> ----------------- >>> Aaron Morton >>> Freelance Developer >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 8/05/2012, at 2:15 PM, Ben Coverston wrote: >>> >>> Check the log files for warnings or errors. They may indicate why your >>> repair failed. >>> >>> On Mon, May 7, 2012 at 10:09 AM, Bill Au wrote: >>> >>>> I restarted the nodes and then restarted the repair. It is still >>>> hanging like before. Do I keep repeating until the repair actually fi= nish? >>>> >>>> Bill >>>> >>>> >>>> On Fri, May 4, 2012 at 2:18 PM, Rob Coli wrote: >>>> >>>>> On Fri, May 4, 2012 at 10:30 AM, Bill Au wrote: >>>>> > I know repair may take a long time to run. I am running repair on = a >>>>> node >>>>> > with about 15 GB of data and it is taking more than 24 hours. Is >>>>> that >>>>> > normal? Is there any way to get status of the repair? tpstats doe= s >>>>> show 2 >>>>> > active and 2 pending AntiEntropySessions. But netstats and >>>>> compactionstats >>>>> > show no activity. >>>>> >>>>> As indicated by various recent threads to this effect, many versions >>>>> of cassandra (including current 1.0.x release) contain bugs which >>>>> sometimes prevent repair from completing. The other threads suggest >>>>> that some of these bugs result in the state you are in now, where you >>>>> do not see anything that looks like appropriate activity. >>>>> Unfortunately the only solution offered on these other threads is the >>>>> one I will now offer, which is to restart the participating nodes and >>>>> re-start the repair. I am unaware of any JIRA tickets tracking these >>>>> bugs (which doesn't mean they don't exist, of course) so you might >>>>> want to file one. :) >>>>> >>>>> =3DRob >>>>> >>>>> -- >>>>> =3DRobert Coli >>>>> AIM>ALK - rcoli@palominodb.com >>>>> YAHOO - rcoli.palominob >>>>> SKYPE - rcoli_palominodb >>>>> >>>> >>>> >>> >>> >>> -- >>> Ben Coverston >>> DataStax -- The Apache Cassandra Company >>> >>> >>> >> > --e89a8fb1fac010fe4604bf99f08e Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I am running 1.0.8.=A0 Two data center with 8 machines in each dc.=A0 Nodes= are all up while repairing is running.=A0 No dropped Mutations/Messages.= =A0 I do see HintedHandoff messages.

Bill

On Tue, May 8, 2012 at 11:15 PM, Vijay <vijay2win@gmail.com&= gt; wrote:
What is the version you are using? is it Mul= ti DC setup? Are you seeing a lot of dropped Mutations/Messages? Are the no= des going up and down all the time while the repair is running?=A0

Regards,
</VJ>




On Tue, May 8, 2012 at 2:05 PM, Bill Au = <bill.w.au@gmail.com> wrote:
There are no error message in my log.

I ended up restarting all the = nodes in my cluster.=A0 After that I was able to run repair successfully on= one of the node.=A0 It took about 40 minutes.=A0 Feeling lucky I ran repai= r on another node and it is stuck again.

tpstats show 1 active and 1 pending AntiEntropySessions.=A0 netstats an= d compactionstats show no activity.=A0 I took a close look at the log file,= it shows that the node requested merkle tree from 4 nodes (including itsel= f).=A0 It actually received 3 of those merkle trees.=A0 It looks like it is= stuck waiting for that last one.=A0 I checked the node where the request w= as sent to, there isn't anything in the log on repair.=A0 So it looks l= ike the merkle tree request has gotten lost some how.=A0 It has been 8 hour= s since the repair was issue and it is still stuck.=A0 I am going to let it= run a bit longer to see if it will eventually finish.

I have observed that if I restart all the nodes, I would be able to run= repair successfully on a single node.=A0 I have done that twice already.= =A0 But after that all repairs will hang.=A0 Since we are supposed to run r= epair periodically, having to restart all nodes before running repair on ea= ch node isn't really viable for us.

Bill


On Tue, = May 8, 2012 at 6:04 AM, aaron morton <aaron@thelastpickle.com>= ; wrote:
When you look in the logs please l= et me know if you see this error=85

I look at nodetool compactionstats (for the Merkle tree phas= e), =A0nodetool netstats for the streaming, and this to check for streaming= progress:

while true; do date; diff <(nodetool -h lo= calhost netstats) <(sleep 5 && nodetool -h localhost netstats); = done

Or use Data Stax Ops Centre where possible=A0http://www.d= atastax.com/products/opscenter

Cheers

=

<= div style=3D"word-wrap:break-word">
-----------------
Aaron Morton
Freelance Deve= loper
@aaronmorton

On 8/05/2012, at 2:15 PM, Ben Coverston wrote:

Check the log files for warnings or errors. They may i= ndicate why your repair failed.

On Mon, M= ay 7, 2012 at 10:09 AM, Bill Au <bill.w.au@gmail.com> wrot= e:
I restarted the= nodes and then restarted the repair.=A0 It is still hanging like before.= =A0 Do I keep repeating until the repair actually finish?

Bill


On Fri, = May 4, 2012 at 2:18 PM, Rob Coli <rcoli@palominodb.com> w= rote:
On Fr= i, May 4, 2012 at 10:30 AM, Bill Au <bill.w.au@gmail.com> wrote:
> I know repair may take a long time to run.=A0 I am running repair on a= node
> with about 15 GB of data and it is taking more than 24 hours.=A0 Is th= at
> normal?=A0 Is there any way to get status of the repair?=A0 tpstats do= es show 2
> active and 2 pending AntiEntropySessions.=A0 But netstats and compacti= onstats
> show no activity.

As indicated by various recent threads to this effect, many ver= sions
of cassandra (including current 1.0.x release) contain bugs which
sometimes prevent repair from completing. The other threads suggest
that some of these bugs result in the state you are in now, where you
do not see anything that looks like appropriate activity.
Unfortunately the only solution offered on these other threads is the
one I will now offer, which is to restart the participating nodes and
re-start the repair. I am unaware of any JIRA tickets tracking these
bugs (which doesn't mean they don't exist, of course) so you might<= br> want to file one. :)

=3DRob

--
=3DRobert Coli
AIM&GTALK - r= coli@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb




--
= Ben Coverston
DataStax -- The Apache Cassandra Company




--e89a8fb1fac010fe4604bf99f08e--