Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAOA66tFut5Rrfonk9bb0TEtKMfdSBSY1t_EOuRRSpkMboLCXXA@mail.gmail.com>
References: 
 <CAOA66tFut5Rrfonk9bb0TEtKMfdSBSY1t_EOuRRSpkMboLCXXA@mail.gmail.com>
Date: Thu, 28 Jul 2011 16:16:01 -0400
Message-ID: 
 <CAJo5+fkFS=pqmUibwgVXGvGVLm1OnPcT-pqDMusykRppNba=aA@mail.gmail.com>
Subject: Re: how to solve one node is in heavy load in unbalanced cluster
From: Frank Duan <frank@aimatch.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=20cf300fb0216c032b04a926d50f

--20cf300fb0216c032b04a926d50f
Content-Type: text/plain; charset=UTF-8

"Dropped read message" might be an indicator of capacity issue. We
experienced the similar issue with 0.7.6.

We ended up adding two extra nodes and physically rebooted the offending
node(s).

The entire cluster then calmed down.

On Thu, Jul 28, 2011 at 2:24 PM, Yan Chunlu <springrider@gmail.com> wrote:

> I have three nodes and RF=3.here is the current ring:
>
>
> Address Status State Load Owns Token
>
> 84944475733633104818662955375549269696
> node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102
> node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360
> node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696
>
>
> it is very un-balanced and I would like to re-balance it using
> "nodetool move" asap. unfortunately I haven't been run node repair for
> a long time.
>
> aaron suggested it's better to run node repair on every node then
> re-balance it.
>
>
> problem is the node3 is in heavy-load currently, and the entire
> cluster slow down if I start doing node repair. I have to
> disablegossip and disablethrift to stop the repair.
>
> only cassandra running on that server and I have no idea what it was
> doing. the cpu load is about 20+ currently. compcationstats and
> netstats shows it was not doing anything.
>
> I have change client to not to connect to node3, but still, it seems
> in heavy load and io utils is 100%.
>
>
> the log seems normal(although not sure what about the "Dropped read
> message" thing):
>
>  INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving
> 2563726360 used; max is 4248829952
>  WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms
>  INFO 13:21:38,560 Pool Name                    Active   Pending
>  INFO 13:21:38,560 ReadStage                         8      7555
>  INFO 13:21:38,561 RequestResponseStage              0         0
>  INFO 13:21:38,561 ReadRepairStage                   0         0
>
>
>
> is there anyway to tell what node3 was doing? or at least is there any
> way to make it not slowdown the whole cluster?
>


-- 
Frank Duan
aiMatch
frank@aimatch.com
c: 703.869.9951
www.aiMatch.com

--20cf300fb0216c032b04a926d50f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div>&quot;Dropped read=C2=A0message&quot; might be an indicator of capacit=
y issue. We experienced the similar issue with 0.7.6.</div><div><br></div><=
div>We ended up adding two extra nodes and physically rebooted the offendin=
g node(s).</div>
<div><br></div><div>The entire cluster then calmed down.</div><br><div clas=
s=3D"gmail_quote">On Thu, Jul 28, 2011 at 2:24 PM, Yan Chunlu <span dir=3D"=
ltr">&lt;<a href=3D"mailto:springrider@gmail.com">springrider@gmail.com</a>=
&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;">I have three nodes and RF=3D3.here is the c=
urrent ring:<br>
<br>
<br>
Address Status State Load Owns Token<br>
<br>
84944475733633104818662955375549269696<br>
node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102<br>
node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360<br>
node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696<br>
<br>
<br>
it is very un-balanced and I would like to re-balance it using<br>
&quot;nodetool move&quot; asap. unfortunately I haven&#39;t been run node r=
epair for<br>
a long time.<br>
<br>
aaron suggested it&#39;s better to run node repair on every node then re-ba=
lance it.<br>
<br>
<br>
problem is the node3 is in heavy-load currently, and the entire<br>
cluster slow down if I start doing node repair. I have to<br>
disablegossip and disablethrift to stop the repair.<br>
<br>
only cassandra running on that server and I have no idea what it was<br>
doing. the cpu load is about 20+ currently. compcationstats and<br>
netstats shows it was not doing anything.<br>
<br>
I have change client to not to connect to node3, but still, it seems<br>
in heavy load and io utils is 100%.<br>
<br>
<br>
the log seems normal(although not sure what about the &quot;Dropped read<br=
>
message&quot; thing):<br>
<br>
=C2=A0INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving<=
br>
<a href=3D"tel:2563726360" value=3D"+12563726360">2563726360</a>=C2=A0used;=
 max is <a href=3D"tel:4248829952" value=3D"+14248829952">4248829952</a><br=
>
=C2=A0WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms<br>
=C2=A0INFO 13:21:38,560 Pool Name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0Active =C2=A0 Pending<br>
=C2=A0INFO 13:21:38,560 ReadStage =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8 =C2=A0 =C2=A0 =C2=A07555<br>
=C2=A0INFO 13:21:38,561 RequestResponseStage =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br>
=C2=A0INFO 13:21:38,561 ReadRepairStage =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br>
<br>
<br>
<br>
is there anyway to tell what node3 was doing? or at least is there any<br>
way to make it not slowdown the whole cluster?<br>
</blockquote></div><br><br clear=3D"all"><br>-- <br><div>Frank Duan</div><d=
iv>aiMatch</div><div><a href=3D"mailto:frank@aimatch.com" target=3D"_blank"=
>frank@aimatch.com</a></div><div>c: 703.869.9951</div><div><a href=3D"http:=
//www.aiMatch.com" target=3D"_blank">www.aiMatch.com</a></div>
<br>

--20cf300fb0216c032b04a926d50f--