Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of watcherfr@gmail.com designates
 209.85.210.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <8BC12F8F-4259-4767-BD5F-EB2FC0539090@thelastpickle.com>
References: 
 <CAHwsXYn5rfBKQbJKxFA5Ae=ciPU3GSBHvgnfC8wviZJU2ah1uw@mail.gmail.com>
	<CAO5xsd2wa1DF1CsKjPyU+1N6pQD9ny_sGi7HWYmFfNkQ=VCS4A@mail.gmail.com>
	<CAHwsXY=zaw9NpQKaUnKxDa1cU0bReYNC88ZeKvq0eLFRNYz7pg@mail.gmail.com>
	<CAHwsXYkMjH5JPdrR-K1KFtj2yeCWmQayCTbjN2+N9At8XMUakA@mail.gmail.com>
	<CAHwsXYmCdxYKWCCWjDvoP7Qs7ahur1fd8aDbW-gA3_ecj38W9Q@mail.gmail.com>
	<CAHwsXYnAniLMqoLnJGWK35xeH3kUa6seoSwsXYDT+YOgmO3r9A@mail.gmail.com>
	<8BC12F8F-4259-4767-BD5F-EB2FC0539090@thelastpickle.com>
Date: Wed, 17 Aug 2011 08:49:15 +0200
Message-ID: 
 <CAHwsXYn37Qu3nTrctP8Z+ow6pazm1o7fGThJnVpg=w3bBR0fBQ@mail.gmail.com>
Subject: Re: Unable to repair a node
From: Philippe <watcherfr@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=000e0cd242e606f1b704aaade541

--000e0cd242e606f1b704aaade541
Content-Type: text/plain; charset=ISO-8859-1

>
> ctrl-c will not stop the repair.
>
Ok, so that's  why I've been seeing logs of repairs on other CFs

That's probably the 2280 issue. Data from all CF's is streamed over
>
Ah, I get it now.

Thanks


>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/08/2011, at 10:09 AM, Philippe wrote:
>
> One last thought : what happens when you ctrl-c a nodetool repair ? Does it
> stop the repair on the server ? If not, then I think I have multiple repairs
> still running. Is there any way to check this ?
>
> Thanks
>
> 2011/8/16 Philippe <watcherfr@gmail.com>
>
>> Even more interesting behavior : a repair on a CF has consequences on
>> other CFs. I didn't expect that.
>>
>> There are no writes being issued to the cluster yet the logs indicate
>> that
>>
>>    - SSTableReader has opened dozens and dozens of files, most of them
>>    unrelated to the CF being repaired
>>    - compactions are taking place continuously on CFs other than the one
>>    being repaired, even CFs in other keyspaces
>>    - I see "Sending AEService tree" messages for CF not being repaired.
>>
>>
>> After a very long time, I got some AES messages indicating that streaming
>> from node C had finished and then many minutes after that node B. And yet
>> the pending stream count on node B hasn't changed.
>>
>> The *-data.db files for the CF being repaired are about 70MB on-disk.
>>
>> Maybe when a stream is fully received on node B, netstats indicates that
>> no streams are pending but since they are not acknowledged, node A doesn't ?
>>
>>
>> 2011/8/16 Philippe <watcherfr@gmail.com>
>>
>>> I'm still trying different stuff. Here are my latest findings, maybe
>>> someone will find them useful:
>>>
>>>    - I have been able to repair some small column families by issuing a
>>>    repair [KS] [CF]. When testing on the ring with no writes at all, it still
>>>    takes about 2 repairs to get "consistent" logs for all AES requests.
>>>    - Launching a repair one the smallest CF of the biggest KS has
>>>    triggered a flurry of compactions and streams. Some of those streams are for
>>>    other CF in that keyspace !?
>>>    - During repairs (one at a time cluster-wide), I get 25-50% io waits
>>>    & 35%-50% cpu usage on a 6 core SATA-disk setup
>>>
>>> What is surprising to me (bug?) is that netstats shows me streams going
>>> from node A to node B at 0% progress. But netstats on node B doesn't show me
>>> any streams coming in. I'm thinking that repairs may be never ending and
>>> that may be messing up my compactions hence the huge pile up of compactions
>>> until the disk fulls.
>>> I know there's an issue related to failed streams & repairs, could I be
>>> hitting it ?
>>>
>>> Thanks
>>>
>>> 2011/8/14 Philippe <watcherfr@gmail.com>
>>>
>>>> @Teijo : thanks for the procedure, I hope I won't have to do that
>>>>
>>>> Peter, I'll answer inline. Thanks for the detailed answer.
>>>>
>>>>
>>>>>  > the number of SSTables for some keyspaces goes dramatically up (from
>>>>> 3 or 4
>>>>> > to several dozens).
>>>>>
>>>>> Typically with a long running compaction, such as that triggered by
>>>>> repair, that's what happens as flushed memtables accumulate. In
>>>>> particular for memtables with frequent flushes.
>>>>>
>>>>> Are you running with concurrent compaction enabled?
>>>>>
>>>> Yes, it is enabled. On my 0.8 cluster, cassandra.yaml has this (it's
>>>> commented). BTW, I have 6 cores on each server.
>>>> #concurrent_compactors: 1
>>>>
>>>> > the commit log keeps increasing in size, I'm at 4.3G now, it went up
>>>>> to 40G
>>>>> > when the compaction was throttled at 16MB/s. On the other nodes it's
>>>>> around
>>>>> > 1GB at most
>>>>> Hmmmm. The Commit Log should not be retained longer than what is
>>>>> required for memtables to be flushed. Is it possible you have had an
>>>>> out-of-disk condition and flushing has stalled? Are you seeing flushes
>>>>> happening in the log?
>>>>>
>>>> No I don't believe there was ever an out of disk.  Yes it is flushing
>>>> for the first couple of hours.
>>>> Then, when repair seems locked up, my log is mostly filled with lines
>>>> such as this
>>>> INFO [ScheduledTasks:1] 2011-08-14 23:15:47,267 StatusLogger.java (line
>>>> 88) [My_Keyspace].[My_Columnfamily]           45,105541               50/50
>>>>               20/20
>>>>  Why is that ?
>>>>
>>>> > the data directory is bigger than on the other nodes. I've seen it go
>>>>> up to
>>>>> > 480GB when the compaction was throttled at 16MB/s
>>>>> How much data are you writing? Is it at all plausible that the huge
>>>>> spike is a reflection of lots of overwriting writes that aren't being
>>>>> compacted?
>>>>>
>>>> No, there's no bulk loading going on at the moment and I'm pretty sure
>>>> there wasn't when it spiked up to that load.
>>>> I've never measured the load because it's a mix of counter increments
>>>> and new counters all the time. It's not that much though.
>>>>
>>>>
>>>>> Normally when disk space spikes with repair it's due to other nodes
>>>>> streaming huge amounts (maybe all of their data) to the node, leading
>>>>> to a temporary spike. But if your "real" size is expected to be 60,
>>>>> 480 sounds excessive. Are you sure other nodes aren't running repairs
>>>>> at the same time and magnifying each other's data load spikes?
>>>>>
>>>> Yes, the two other nodes were running repairs. I had them scheduled at 8
>>>> hour intervals but they must have started.
>>>> When data is streamed from one to another, does that data go into the
>>>> commit log as a regular write ?
>>>>  How much of a negative impact can that have on the repair going on on
>>>> this node ?
>>>>
>>>> > What's even weirder is that currently I have 9 compactions running but
>>>>> CPU
>>>>> > is throttled at 1/number of cores half the time (while > 80% the rest
>>>>> of the
>>>>> > time). Could this be because other repairs are happening in the ring
>>>>> ?
>>>>> You mean compaction is taking less CPU than it "should"?
>>>>>
>>>> Yes
>>>>
>>>>
>>>>> No, this should not be due to other nodes repairing. However it sounds
>>>>> to me like you are bottlenecking on I/O and the repairs and
>>>>>
>>>> Yes, I/O is really high on the node right now. Around 50% I/O waits.
>>>>
>>>>
>>>>> compactions are probably proceeding extremely slowly, probably being
>>>>> completely drowned out by live traffic (which is probably having an
>>>>> abnormally high performance impact due to data size spike).
>>>>>
>>>> Yes, the live traffic is 3 to 10x times slower during repair. Ouch... I
>>>> hope I won't to do this too often while in production !
>>>>
>>>>
>>>>>
>>>>> What's your read concurrency configured on the node? What does "iostat
>>>>> -x -k 1" show in the average queue size column?
>>>>
>>>> Average queue size on the disk (RAID-1 + separate LVM volumes for data,
>>>> commit log, caches, logs)) varies between 2 and 90. I'd say the average is
>>>> around 30-40. Very high variation.
>>>>
>>>>
>>>>> Is "nodetool -h
>>>>> localhost tpstats" showing that ReadStage is usually "full" (@ your
>>>>> limit)?
>>>>>
>>>> No backlog at all in tpstats
>>>>
>>>> I've figured out how AES is logging its actions and it looks like it
>>>> really is going through every CF in every keyspace and doing a tree request
>>>> for every token range
>>>> So it really looks like it's just taking forever to compact stuff as
>>>> it's repairing.
>>>> I saw in another email that repairing was taking 2-3mn/ GB... it looks
>>>> like a lot more for my ring. Anybody else have numbers ?
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>
>

--000e0cd242e606f1b704aaade541
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margi=
n:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style=3D"wo=
rd-wrap:break-word">ctrl-c will not stop the repair.=A0</div></blockquote><=
div>Ok, so that&#39;s =A0why I&#39;ve been seeing logs of repairs on other =
CFs=A0</div>
<div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex=
;border-left:1px #ccc solid;padding-left:1ex;"><div style=3D"word-wrap:brea=
k-word"><div>That&#39;s probably the 2280 issue. Data from all CF&#39;s is =
streamed over</div>
</div></blockquote><div>Ah, I get it now.</div><div><br></div><div>Thanks</=
div><div><br></div><div><br></div><div>=A0</div><div>=A0</div><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex;">
<div style=3D"word-wrap:break-word"><div></div><div><br></div><div>Cheers</=
div><div><br><div>
<span style=3D"border-collapse:separate;color:rgb(0, 0, 0);font-family:Helv=
etica;font-style:normal;font-variant:normal;font-weight:normal;letter-spaci=
ng:normal;line-height:normal;text-align:auto;text-indent:0px;text-transform=
:none;white-space:normal;word-spacing:0px;font-size:medium"><span style=3D"=
border-collapse:separate;color:rgb(0, 0, 0);font-family:Helvetica;font-styl=
e:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-=
height:normal;text-indent:0px;text-transform:none;white-space:normal;word-s=
pacing:0px;font-size:medium"><div style=3D"word-wrap:break-word">
<span style=3D"border-collapse:separate;color:rgb(0, 0, 0);font-family:Helv=
etica;font-style:normal;font-variant:normal;font-weight:normal;letter-spaci=
ng:normal;line-height:normal;text-indent:0px;text-transform:none;white-spac=
e:normal;word-spacing:0px;font-size:medium"><div style=3D"word-wrap:break-w=
ord">
<div><div>-----------------</div><div>Aaron Morton</div><font color=3D"#888=
888"><div>Freelance Cassandra Developer</div><div>@aaronmorton</div><div><a=
 href=3D"http://www.thelastpickle.com" target=3D"_blank">http://www.thelast=
pickle.com</a></div>
</font></div></div></span></div></span></span>
</div><div><div></div><div class=3D"h5">

<br><div><div>On 17/08/2011, at 10:09 AM, Philippe wrote:</div><br><blockqu=
ote type=3D"cite">One last thought : what happens when you ctrl-c a nodetoo=
l repair ? Does it stop the repair on the server ? If not, then I think I h=
ave multiple repairs still running. Is there any way to check this ?<div>
<br></div><div>Thanks<br>
<br><div class=3D"gmail_quote">2011/8/16 Philippe <span dir=3D"ltr">&lt;<a =
href=3D"mailto:watcherfr@gmail.com" target=3D"_blank">watcherfr@gmail.com</=
a>&gt;</span><br><blockquote class=3D"gmail_quote" style=3D"margin-top:0px;=
margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;=
border-left-color:rgb(204, 204, 204);border-left-style:solid;padding-left:1=
ex">

Even more interesting behavior : a repair on a CF has consequences on other=
 CFs. I didn&#39;t expect that.<div><br></div><div>There are no writes bein=
g issued to the cluster yet the logs indicate that=A0</div><div><ul><li>
SSTableReader has opened dozens and dozens of files, most of them unrelated=
 to the CF being repaired</li>

<li>compactions are taking place continuously on CFs other than the one bei=
ng repaired, even CFs in other keyspaces</li><li><div>I see &quot;Sending A=
EService tree&quot; messages for CF not being repaired.</div><div><br>

</div>
</li></ul>After a very long time, I got some AES messages indicating that s=
treaming from node C had finished and then many minutes after that node B. =
And yet the pending stream count on node B hasn&#39;t changed.</div><div>


<br></div><div><div>The *-data.db files for the CF being repaired are about=
 70MB on-disk.<br><div><br></div><div>Maybe when a stream is fully received=
 on node B, netstats indicates that no streams are pending but since they a=
re not acknowledged, node A doesn&#39;t ?</div>

<div><div></div><div>
<div><br></div><div><br></div><div><div class=3D"gmail_quote">2011/8/16 Phi=
lippe <span dir=3D"ltr">&lt;<a href=3D"mailto:watcherfr@gmail.com" target=
=3D"_blank">watcherfr@gmail.com</a>&gt;</span><br><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex">


I&#39;m still trying different stuff. Here are my latest findings, maybe so=
meone will find them useful:<div><ul><li>I have been able to repair some sm=
all column families by issuing a repair [KS] [CF]. When testing on the ring=
 with no writes at all, it still takes about 2 repairs to get &quot;consist=
ent&quot; logs for all AES requests.</li>


<li>Launching a repair one the smallest CF of the biggest KS has triggered =
a flurry of compactions and streams. Some of those streams are for other CF=
 in that keyspace !?</li><li>During repairs (one at a time cluster-wide), I=
 get 25-50% io waits &amp; 35%-50% cpu usage on a 6 core SATA-disk setup</l=
i>


</ul><div>What is surprising to me (bug?) is that netstats shows me streams=
 going from node A to node B at 0% progress. But netstats on node B doesn&#=
39;t show me any streams coming in. I&#39;m thinking that repairs may be ne=
ver ending and that may be messing up my compactions hence the huge pile up=
 of compactions until the disk fulls.</div>


<div>I know there&#39;s an issue related to failed streams &amp; repairs, c=
ould I be hitting it ?</div><div><br></div><div>Thanks</div><br><div class=
=3D"gmail_quote"><div>2011/8/14 Philippe <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:watcherfr@gmail.com" target=3D"_blank">watcherfr@gmail.com</a>&gt;</s=
pan><br>


</div><div><div></div><div><blockquote class=3D"gmail_quote" style=3D"margi=
n:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">@<span style=3D"f=
ont-family:arial, sans-serif;font-size:13px;border-collapse:collapse">Teijo=
 : thanks for the procedure, I hope I won&#39;t have to do that</span><br>


<br><div>Peter, I&#39;ll answer inline. Thanks for the detailed answer.<br>=
<div class=3D"gmail_quote"><div><div>=A0</div><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
">
<div>

&gt; the number of SSTables for some keyspaces goes dramatically up (from 3=
 or 4<br>
&gt; to several dozens).<br>
<br>
</div>Typically with a long running compaction, such as that triggered by<b=
r>
repair, that&#39;s what happens as flushed memtables accumulate. In<br>
particular for memtables with frequent flushes.<br>
<br>
Are you running with concurrent compaction enabled?<br></blockquote></div><=
div>Yes, it is enabled. On my 0.8 cluster, cassandra.yaml has this (it&#39;=
s commented). BTW, I have 6 cores on each server.</div><div><div style=3D"m=
argin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px">


#concurrent_compactors: 1</div></div><div><div><br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
<div>&gt; the commit log keeps increasing in size, I&#39;m at 4.3G now, it =
went up to 40G<br>
&gt; when the compaction was throttled at 16MB/s. On the other nodes it&#39=
;s around<br>
&gt; 1GB at most<br>Hmmmm. The Commit Log should not be retained longer tha=
n what is</div>
required for memtables to be flushed. Is it possible you have had an<br>
out-of-disk condition and flushing has stalled? Are you seeing flushes<br>
happening in the log?<br></blockquote></div><div>No I don&#39;t believe the=
re was ever an out of disk. =A0Yes it is flushing for the first couple of h=
ours.</div><div>Then, when repair seems locked up, my log is mostly filled =
with lines such as this</div>


<div><font face=3D"&#39;courier new&#39;, monospace">INFO [ScheduledTasks:1=
] 2011-08-14 23:15:47,267 StatusLogger.java (line 88) [My_Keyspace].[My_Col=
umnfamily] =A0 =A0 =A0 =A0 =A0 45,105541 =A0 =A0 =A0 =A0 =A0 =A0 =A0 50/50 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 20/20</font></div>


<div>=A0Why is that ?</div><div><div><br></div><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x">
<div>&gt; the data directory is bigger than on the other nodes. I&#39;ve se=
en it go up to<br>
&gt; 480GB=A0when the compaction was throttled at 16MB/s<br>How much data a=
re you writing? Is it at all plausible that the huge</div>
spike is a reflection of lots of overwriting writes that aren&#39;t being<b=
r>
compacted?<br></blockquote></div><div>No, there&#39;s no bulk loading going=
 on at the moment and I&#39;m pretty sure there wasn&#39;t when it spiked u=
p to that load.</div><div>I&#39;ve never measured the load because it&#39;s=
 a mix of counter increments and new counters all the time. It&#39;s not th=
at much though.</div>


<div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">Normally when disk space spike=
s with repair it&#39;s due to other nodes<br>
streaming huge amounts (maybe all of their data) to the node, leading<br>
to a temporary spike. But if your &quot;real&quot; size is expected to be 6=
0,<br>
480 sounds excessive. Are you sure other nodes aren&#39;t running repairs<b=
r>
at the same time and magnifying each other&#39;s data load spikes?<br></blo=
ckquote></div><div>Yes, the two other nodes were running repairs. I had the=
m scheduled at 8 hour intervals but they must have started.</div><div>


When data is streamed from one to another, does that data go into the commi=
t log as a regular write ?</div>
<div>=A0How much of a negative impact can that have on the repair going on =
on this node ?</div><div><div><br></div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div>&gt; What&#39;s even weirder is that currently I have 9 compactions ru=
nning but CPU<br>
&gt; is throttled at 1/number of cores half the time (while &gt; 80% the re=
st of the<br>
&gt; time). Could this be because other repairs are happening in the ring ?=
<br>You mean compaction is taking less CPU than it &quot;should&quot;?</div=
></blockquote></div><div>Yes</div><div><div>=A0</div><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">


No, this should not be due to other nodes repairing. However it sounds<br>
to me like you are bottlenecking on I/O and the repairs and<br></blockquote=
></div><div>Yes, I/O is really high on the node right now. Around 50% I/O w=
aits.</div><div><div>=A0</div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


compactions are probably proceeding extremely slowly, probably being<br>
completely drowned out by live traffic (which is probably having an<br>
abnormally high performance impact due to data size spike).<br></blockquote=
></div><div>Yes, the live traffic is 3 to 10x times slower during repair. O=
uch... I hope I won&#39;t to do this too often while in production !</div>


<div><div>
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex">
<br>
What&#39;s your read concurrency configured on the node? What does &quot;io=
stat<br>
-x -k 1&quot; show in the average queue size column? </blockquote></div><di=
v>Average queue size on the disk (RAID-1 + separate LVM volumes for data, c=
ommit log, caches, logs)) varies between 2 and 90. I&#39;d say the average =
is around 30-40. Very high variation.</div>


<div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">Is &quot;nodetool -h<br>
localhost tpstats&quot; showing that ReadStage is usually &quot;full&quot; =
(@ your<br>
limit)?<br></blockquote></div><div>No backlog at all in tpstats</div><div>=
=A0</div><div>I&#39;ve figured out how AES is logging its actions and it lo=
oks like it really is going through every CF in every keyspace and doing a =
tree request for every token range</div>


<div>So it really looks like it&#39;s just taking forever to compact stuff =
as it&#39;s repairing.=A0</div><div>I saw in another email that repairing w=
as taking 2-3mn/ GB... it looks like a lot more for my ring. Anybody else h=
ave numbers ?</div>


<div><br></div><div>Thanks</div></div></div>
</blockquote></div></div></div><br></div>
</blockquote></div><br></div></div></div></div></div>
</blockquote></div><br></div>
</blockquote></div><br></div></div></div></div></blockquote></div><br>

--000e0cd242e606f1b704aaade541--