Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of bill.w.au@gmail.com
 designates 209.85.213.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFisTNaWCq7o1RNZ3OPcC8ss3J5XGT6+7fB8OmjSyqQ=jc-kEg@mail.gmail.com>
References: 
 <CAGJmZPxrf1rOcPJrY4OrE_ZDZe_9+-L6i2pCCtKxAkhv9svpSQ@mail.gmail.com>
	<CACHzRHa9GwnVZOOmKtm85_g8ERsaBFH+UPJ9sGyR242rNs4Mfg@mail.gmail.com>
	<CAGJmZPyxwTOyZH0gHKVRqvwuVbcrx=0Zdv17XLt0HDFeFbUUKA@mail.gmail.com>
	<CANNkHXYbh9-j08+_cpwxiH1+r2QTpKALR55V8wQKkHSe8TvbWw@mail.gmail.com>
	<15CE4CB3-EA3C-4C49-B0F9-1534D81A92DA@thelastpickle.com>
	<CAGJmZPz7kVSnUzonXfmdur5wm8DXr6Z-q6zy+Q4UxY58ZZHiFw@mail.gmail.com>
	<CAFisTNaWCq7o1RNZ3OPcC8ss3J5XGT6+7fB8OmjSyqQ=jc-kEg@mail.gmail.com>
Date: Wed, 9 May 2012 08:49:45 -0400
Message-ID: 
 <CAGJmZPwB=mG1H=1cs5AoND5kJeyLkr3MWfM2NZ=7=6E1=5-CGw@mail.gmail.com>
Subject: Re: getting status of long running repair
From: Bill Au <bill.w.au@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=e89a8fb1fac010fe4604bf99f08e

--e89a8fb1fac010fe4604bf99f08e
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

I am running 1.0.8.  Two data center with 8 machines in each dc.  Nodes are
all up while repairing is running.  No dropped Mutations/Messages.  I do
see HintedHandoff messages.

Bill

On Tue, May 8, 2012 at 11:15 PM, Vijay <vijay2win@gmail.com> wrote:

> What is the version you are using? is it Multi DC setup? Are you seeing a
> lot of dropped Mutations/Messages? Are the nodes going up and down all th=
e
> time while the repair is running?
>
> Regards,
> </VJ>
>
>
>
>
> On Tue, May 8, 2012 at 2:05 PM, Bill Au <bill.w.au@gmail.com> wrote:
>
>> There are no error message in my log.
>>
>> I ended up restarting all the nodes in my cluster.  After that I was abl=
e
>> to run repair successfully on one of the node.  It took about 40 minutes=
.
>> Feeling lucky I ran repair on another node and it is stuck again.
>>
>> tpstats show 1 active and 1 pending AntiEntropySessions.  netstats and
>> compactionstats show no activity.  I took a close look at the log file, =
it
>> shows that the node requested merkle tree from 4 nodes (including itself=
).
>> It actually received 3 of those merkle trees.  It looks like it is stuck
>> waiting for that last one.  I checked the node where the request was sen=
t
>> to, there isn't anything in the log on repair.  So it looks like the mer=
kle
>> tree request has gotten lost some how.  It has been 8 hours since the
>> repair was issue and it is still stuck.  I am going to let it run a bit
>> longer to see if it will eventually finish.
>>
>> I have observed that if I restart all the nodes, I would be able to run
>> repair successfully on a single node.  I have done that twice already.  =
But
>> after that all repairs will hang.  Since we are supposed to run repair
>> periodically, having to restart all nodes before running repair on each
>> node isn't really viable for us.
>>
>> Bill
>>
>>
>> On Tue, May 8, 2012 at 6:04 AM, aaron morton <aaron@thelastpickle.com>wr=
ote:
>>
>>> When you look in the logs please let me know if you see this error=85
>>> https://issues.apache.org/jira/browse/CASSANDRA-4223
>>>
>>> I look at nodetool compactionstats (for the Merkle tree phase),
>>>  nodetool netstats for the streaming, and this to check for streaming
>>> progress:
>>>
>>> while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 &=
&
>>> nodetool -h localhost netstats); done
>>>
>>> Or use Data Stax Ops Centre where possible
>>> http://www.datastax.com/products/opscenter
>>>
>>> Cheers
>>>
>>>
>>>   -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 8/05/2012, at 2:15 PM, Ben Coverston wrote:
>>>
>>> Check the log files for warnings or errors. They may indicate why your
>>> repair failed.
>>>
>>> On Mon, May 7, 2012 at 10:09 AM, Bill Au <bill.w.au@gmail.com> wrote:
>>>
>>>> I restarted the nodes and then restarted the repair.  It is still
>>>> hanging like before.  Do I keep repeating until the repair actually fi=
nish?
>>>>
>>>> Bill
>>>>
>>>>
>>>> On Fri, May 4, 2012 at 2:18 PM, Rob Coli <rcoli@palominodb.com> wrote:
>>>>
>>>>> On Fri, May 4, 2012 at 10:30 AM, Bill Au <bill.w.au@gmail.com> wrote:
>>>>> > I know repair may take a long time to run.  I am running repair on =
a
>>>>> node
>>>>> > with about 15 GB of data and it is taking more than 24 hours.  Is
>>>>> that
>>>>> > normal?  Is there any way to get status of the repair?  tpstats doe=
s
>>>>> show 2
>>>>> > active and 2 pending AntiEntropySessions.  But netstats and
>>>>> compactionstats
>>>>> > show no activity.
>>>>>
>>>>> As indicated by various recent threads to this effect, many versions
>>>>> of cassandra (including current 1.0.x release) contain bugs which
>>>>> sometimes prevent repair from completing. The other threads suggest
>>>>> that some of these bugs result in the state you are in now, where you
>>>>> do not see anything that looks like appropriate activity.
>>>>> Unfortunately the only solution offered on these other threads is the
>>>>> one I will now offer, which is to restart the participating nodes and
>>>>> re-start the repair. I am unaware of any JIRA tickets tracking these
>>>>> bugs (which doesn't mean they don't exist, of course) so you might
>>>>> want to file one. :)
>>>>>
>>>>> =3DRob
>>>>>
>>>>> --
>>>>> =3DRobert Coli
>>>>> AIM&GTALK - rcoli@palominodb.com
>>>>> YAHOO - rcoli.palominob
>>>>> SKYPE - rcoli_palominodb
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ben Coverston
>>> DataStax -- The Apache Cassandra Company
>>>
>>>
>>>
>>
>

--e89a8fb1fac010fe4604bf99f08e
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

I am running 1.0.8.=A0 Two data center with 8 machines in each dc.=A0 Nodes=
 are all up while repairing is running.=A0 No dropped Mutations/Messages.=
=A0 I do see HintedHandoff messages.<br><br>Bill<br><br><div class=3D"gmail=
_quote">On Tue, May 8, 2012 at 11:15 PM, Vijay <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:vijay2win@gmail.com" target=3D"_blank">vijay2win@gmail.com</a>&=
gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">What is the version you are using? is it Mul=
ti DC setup? Are you seeing a lot of dropped Mutations/Messages? Are the no=
des going up and down all the time while the repair is running?=A0<div>
<br clear=3D"all">Regards,<br>&lt;/VJ&gt;<div><div class=3D"h5"><br>

<br>
<br><br><div class=3D"gmail_quote">On Tue, May 8, 2012 at 2:05 PM, Bill Au =
<span dir=3D"ltr">&lt;<a href=3D"mailto:bill.w.au@gmail.com" target=3D"_bla=
nk">bill.w.au@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex">


There are no error message in my log.<br><br>I ended up restarting all the =
nodes in my cluster.=A0 After that I was able to run repair successfully on=
 one of the node.=A0 It took about 40 minutes.=A0 Feeling lucky I ran repai=
r on another node and it is stuck again.<br>


<br>tpstats show 1 active and 1 pending AntiEntropySessions.=A0 netstats an=
d compactionstats show no activity.=A0 I took a close look at the log file,=
 it shows that the node requested merkle tree from 4 nodes (including itsel=
f).=A0 It actually received 3 of those merkle trees.=A0 It looks like it is=
 stuck waiting for that last one.=A0 I checked the node where the request w=
as sent to, there isn&#39;t anything in the log on repair.=A0 So it looks l=
ike the merkle tree request has gotten lost some how.=A0 It has been 8 hour=
s since the repair was issue and it is still stuck.=A0 I am going to let it=
 run a bit longer to see if it will eventually finish.<br>


<br>I have observed that if I restart all the nodes, I would be able to run=
 repair successfully on a single node.=A0 I have done that twice already.=
=A0 But after that all repairs will hang.=A0 Since we are supposed to run r=
epair periodically, having to restart all nodes before running repair on ea=
ch node isn&#39;t really viable for us.<span><font color=3D"#888888"><br>


<br>Bill</font></span><div><div><br><br><div class=3D"gmail_quote">On Tue, =
May 8, 2012 at 6:04 AM, aaron morton <span dir=3D"ltr">&lt;<a href=3D"mailt=
o:aaron@thelastpickle.com" target=3D"_blank">aaron@thelastpickle.com</a>&gt=
;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word"><div>When you look in the logs please l=
et me know if you see this error=85</div><div><a href=3D"https://issues.apa=
che.org/jira/browse/CASSANDRA-4223" target=3D"_blank">https://issues.apache=
.org/jira/browse/CASSANDRA-4223</a></div>


<div><br></div>I look at nodetool compactionstats (for the Merkle tree phas=
e), =A0nodetool netstats for the streaming, and this to check for streaming=
 progress:<div><br></div><div>while true; do date; diff &lt;(nodetool -h lo=
calhost netstats) &lt;(sleep 5 &amp;&amp; nodetool -h localhost netstats); =
done</div>


<div><br></div><div>Or use Data Stax Ops Centre where possible=A0<a href=3D=
"http://www.datastax.com/products/opscenter" target=3D"_blank">http://www.d=
atastax.com/products/opscenter</a><div><br></div><div>Cheers</div><div><br>=
<div>


</div>
<br><div>
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;te=
xt-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:norm=
al;border-collapse:separate;text-transform:none;font-size:medium;white-spac=
e:normal;font-family:Helvetica;word-spacing:0px"><span style=3D"text-indent=
:0px;letter-spacing:normal;font-variant:normal;font-style:normal;font-weigh=
t:normal;line-height:normal;border-collapse:separate;text-transform:none;fo=
nt-size:medium;white-space:normal;font-family:Helvetica;word-spacing:0px"><=
div style=3D"word-wrap:break-word">


<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><div style=3D"word-wrap:break-word">


<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><div style=3D"word-wrap:break-word">


<div><div>-----------------</div><div>Aaron Morton</div><div>Freelance Deve=
loper</div><div>@aaronmorton</div><div><a href=3D"http://www.thelastpickle.=
com" target=3D"_blank">http://www.thelastpickle.com</a></div></div></div></=
span></div>


</span></div></span></span>
</div><div><div>
<br><div><div>On 8/05/2012, at 2:15 PM, Ben Coverston wrote:</div><br><bloc=
kquote type=3D"cite">Check the log files for warnings or errors. They may i=
ndicate why your repair failed.<br><br><div class=3D"gmail_quote">On Mon, M=
ay 7, 2012 at 10:09 AM, Bill Au <span dir=3D"ltr">&lt;<a href=3D"mailto:bil=
l.w.au@gmail.com" target=3D"_blank">bill.w.au@gmail.com</a>&gt;</span> wrot=
e:<br>


<blockquote class=3D"gmail_quote" style=3D"margin-top:0px;margin-right:0px;=
margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color=
:rgb(204,204,204);border-left-style:solid;padding-left:1ex">I restarted the=
 nodes and then restarted the repair.=A0 It is still hanging like before.=
=A0 Do I keep repeating until the repair actually finish?<span><font color=
=3D"#888888"><br>


<br>Bill</font></span><div><div><br><br><div class=3D"gmail_quote">On Fri, =
May 4, 2012 at 2:18 PM, Rob Coli <span dir=3D"ltr">&lt;<a href=3D"mailto:rc=
oli@palominodb.com" target=3D"_blank">rcoli@palominodb.com</a>&gt;</span> w=
rote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin-top:0px;margin-right:0px;=
margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color=
:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div>On Fr=
i, May 4, 2012 at 10:30 AM, Bill Au &lt;<a href=3D"mailto:bill.w.au@gmail.c=
om" target=3D"_blank">bill.w.au@gmail.com</a>&gt; wrote:<br>


&gt; I know repair may take a long time to run.=A0 I am running repair on a=
 node<br>
&gt; with about 15 GB of data and it is taking more than 24 hours.=A0 Is th=
at<br>
&gt; normal?=A0 Is there any way to get status of the repair?=A0 tpstats do=
es show 2<br>
&gt; active and 2 pending AntiEntropySessions.=A0 But netstats and compacti=
onstats<br>
&gt; show no activity.<br>
<br>
</div></div>As indicated by various recent threads to this effect, many ver=
sions<br>
of cassandra (including current 1.0.x release) contain bugs which<br>
sometimes prevent repair from completing. The other threads suggest<br>
that some of these bugs result in the state you are in now, where you<br>
do not see anything that looks like appropriate activity.<br>
Unfortunately the only solution offered on these other threads is the<br>
one I will now offer, which is to restart the participating nodes and<br>
re-start the repair. I am unaware of any JIRA tickets tracking these<br>
bugs (which doesn&#39;t mean they don&#39;t exist, of course) so you might<=
br>
want to file one. :)<br>
<span><font color=3D"#888888"><br>
=3DRob<br>
<br>
--<br>
=3DRobert Coli<br>
AIM&amp;GTALK - <a href=3D"mailto:rcoli@palominodb.com" target=3D"_blank">r=
coli@palominodb.com</a><br>
YAHOO - rcoli.palominob<br>
SKYPE - rcoli_palominodb<br>
</font></span></blockquote></div><br>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
Ben Coverston<div>DataStax -- The Apache Cassandra Company</div><br>
</blockquote></div><br></div></div></div></div></div></blockquote></div><br=
>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br>

--e89a8fb1fac010fe4604bf99f08e--