Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of springrider@gmail.com
 designates 209.85.161.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJo5+fkFS=pqmUibwgVXGvGVLm1OnPcT-pqDMusykRppNba=aA@mail.gmail.com>
References: 
 <CAOA66tFut5Rrfonk9bb0TEtKMfdSBSY1t_EOuRRSpkMboLCXXA@mail.gmail.com>
 <CAJo5+fkFS=pqmUibwgVXGvGVLm1OnPcT-pqDMusykRppNba=aA@mail.gmail.com>
From: Yan Chunlu <springrider@gmail.com>
Date: Fri, 29 Jul 2011 12:01:17 +0800
Message-ID: 
 <CAOA66tEa4EV0yMP3g1PMRMRsRnvc8=rXKyTujtCfvzFDQ0+AQA@mail.gmail.com>
Subject: Re: how to solve one node is in heavy load in unbalanced cluster
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0015174bef2e88520204a92d56f6

--0015174bef2e88520204a92d56f6
Content-Type: text/plain; charset=ISO-8859-1

add new nodes seems added more pressure  to the cluster?  how about your
data size?

On Fri, Jul 29, 2011 at 4:16 AM, Frank Duan <frank@aimatch.com> wrote:

> "Dropped read message" might be an indicator of capacity issue. We
> experienced the similar issue with 0.7.6.
>
> We ended up adding two extra nodes and physically rebooted the offending
> node(s).
>
> The entire cluster then calmed down.
>
> On Thu, Jul 28, 2011 at 2:24 PM, Yan Chunlu <springrider@gmail.com> wrote:
>
>> I have three nodes and RF=3.here is the current ring:
>>
>>
>> Address Status State Load Owns Token
>>
>> 84944475733633104818662955375549269696
>> node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102
>> node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360
>> node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696
>>
>>
>> it is very un-balanced and I would like to re-balance it using
>> "nodetool move" asap. unfortunately I haven't been run node repair for
>> a long time.
>>
>> aaron suggested it's better to run node repair on every node then
>> re-balance it.
>>
>>
>> problem is the node3 is in heavy-load currently, and the entire
>> cluster slow down if I start doing node repair. I have to
>> disablegossip and disablethrift to stop the repair.
>>
>> only cassandra running on that server and I have no idea what it was
>> doing. the cpu load is about 20+ currently. compcationstats and
>> netstats shows it was not doing anything.
>>
>> I have change client to not to connect to node3, but still, it seems
>> in heavy load and io utils is 100%.
>>
>>
>> the log seems normal(although not sure what about the "Dropped read
>> message" thing):
>>
>>  INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving
>> 2563726360 used; max is 4248829952
>>  WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms
>>  INFO 13:21:38,560 Pool Name                    Active   Pending
>>  INFO 13:21:38,560 ReadStage                         8      7555
>>  INFO 13:21:38,561 RequestResponseStage              0         0
>>  INFO 13:21:38,561 ReadRepairStage                   0         0
>>
>>
>>
>> is there anyway to tell what node3 was doing? or at least is there any
>> way to make it not slowdown the whole cluster?
>>
>
>
>
> --
> Frank Duan
> aiMatch
> frank@aimatch.com
> c: 703.869.9951
> www.aiMatch.com
>
>

--0015174bef2e88520204a92d56f6
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

add new nodes seems added more pressure =A0to the cluster? =A0how about you=
r data size?<br><br><div class=3D"gmail_quote">On Fri, Jul 29, 2011 at 4:16=
 AM, Frank Duan <span dir=3D"ltr">&lt;<a href=3D"mailto:frank@aimatch.com">=
frank@aimatch.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;"><div>&quot;Dropped read=A0message&quot; mig=
ht be an indicator of capacity issue. We experienced the similar issue with=
 0.7.6.</div>

<div><br></div><div>We ended up adding two extra nodes and physically reboo=
ted the offending node(s).</div>
<div><br></div><div>The entire cluster then calmed down.</div><div><div></d=
iv><div class=3D"h5"><br><div class=3D"gmail_quote">On Thu, Jul 28, 2011 at=
 2:24 PM, Yan Chunlu <span dir=3D"ltr">&lt;<a href=3D"mailto:springrider@gm=
ail.com" target=3D"_blank">springrider@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">I have three nodes and RF=3D3.here is the cu=
rrent ring:<br>
<br>
<br>
Address Status State Load Owns Token<br>
<br>
84944475733633104818662955375549269696<br>
node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102<br>
node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360<br>
node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696<br>
<br>
<br>
it is very un-balanced and I would like to re-balance it using<br>
&quot;nodetool move&quot; asap. unfortunately I haven&#39;t been run node r=
epair for<br>
a long time.<br>
<br>
aaron suggested it&#39;s better to run node repair on every node then re-ba=
lance it.<br>
<br>
<br>
problem is the node3 is in heavy-load currently, and the entire<br>
cluster slow down if I start doing node repair. I have to<br>
disablegossip and disablethrift to stop the repair.<br>
<br>
only cassandra running on that server and I have no idea what it was<br>
doing. the cpu load is about 20+ currently. compcationstats and<br>
netstats shows it was not doing anything.<br>
<br>
I have change client to not to connect to node3, but still, it seems<br>
in heavy load and io utils is 100%.<br>
<br>
<br>
the log seems normal(although not sure what about the &quot;Dropped read<br=
>
message&quot; thing):<br>
<br>
=A0INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving<br>
<a href=3D"tel:2563726360" value=3D"+12563726360" target=3D"_blank">2563726=
360</a>=A0used; max is <a href=3D"tel:4248829952" value=3D"+14248829952" ta=
rget=3D"_blank">4248829952</a><br>
=A0WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms<br>
=A0INFO 13:21:38,560 Pool Name =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Activ=
e =A0 Pending<br>
=A0INFO 13:21:38,560 ReadStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 8 =A0 =A0 =A07555<br>
=A0INFO 13:21:38,561 RequestResponseStage =A0 =A0 =A0 =A0 =A0 =A0 =A00 =A0 =
=A0 =A0 =A0 0<br>
=A0INFO 13:21:38,561 ReadRepairStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =
=A0 =A0 =A0 =A0 0<br>
<br>
<br>
<br>
is there anyway to tell what node3 was doing? or at least is there any<br>
way to make it not slowdown the whole cluster?<br>
</blockquote></div><br><br clear=3D"all"><br></div></div><font color=3D"#88=
8888">-- <br><div>Frank Duan</div><div>aiMatch</div><div><a href=3D"mailto:=
frank@aimatch.com" target=3D"_blank">frank@aimatch.com</a></div><div>c: 703=
.869.9951</div>

<div><a href=3D"http://www.aiMatch.com" target=3D"_blank">www.aiMatch.com</=
a></div>
<br>
</font></blockquote></div><br>

--0015174bef2e88520204a92d56f6--