Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A11AC7253 for ; Thu, 28 Jul 2011 20:16:30 +0000 (UTC) Received: (qmail 14112 invoked by uid 500); 28 Jul 2011 20:16:28 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 14053 invoked by uid 500); 28 Jul 2011 20:16:27 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 14045 invoked by uid 99); 28 Jul 2011 20:16:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jul 2011 20:16:27 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL,TO_NO_BRKTS_PCNT X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.216.179] (HELO mail-qy0-f179.google.com) (209.85.216.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jul 2011 20:16:22 +0000 Received: by qyk29 with SMTP id 29so1951251qyk.10 for ; Thu, 28 Jul 2011 13:16:01 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.207.193 with SMTP id fz1mr367643qab.334.1311884161319; Thu, 28 Jul 2011 13:16:01 -0700 (PDT) Received: by 10.224.53.202 with HTTP; Thu, 28 Jul 2011 13:16:01 -0700 (PDT) In-Reply-To: References: Date: Thu, 28 Jul 2011 16:16:01 -0400 Message-ID: Subject: Re: how to solve one node is in heavy load in unbalanced cluster From: Frank Duan To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf300fb0216c032b04a926d50f --20cf300fb0216c032b04a926d50f Content-Type: text/plain; charset=UTF-8 "Dropped read message" might be an indicator of capacity issue. We experienced the similar issue with 0.7.6. We ended up adding two extra nodes and physically rebooted the offending node(s). The entire cluster then calmed down. On Thu, Jul 28, 2011 at 2:24 PM, Yan Chunlu wrote: > I have three nodes and RF=3.here is the current ring: > > > Address Status State Load Owns Token > > 84944475733633104818662955375549269696 > node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102 > node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360 > node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696 > > > it is very un-balanced and I would like to re-balance it using > "nodetool move" asap. unfortunately I haven't been run node repair for > a long time. > > aaron suggested it's better to run node repair on every node then > re-balance it. > > > problem is the node3 is in heavy-load currently, and the entire > cluster slow down if I start doing node repair. I have to > disablegossip and disablethrift to stop the repair. > > only cassandra running on that server and I have no idea what it was > doing. the cpu load is about 20+ currently. compcationstats and > netstats shows it was not doing anything. > > I have change client to not to connect to node3, but still, it seems > in heavy load and io utils is 100%. > > > the log seems normal(although not sure what about the "Dropped read > message" thing): > > INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving > 2563726360 used; max is 4248829952 > WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms > INFO 13:21:38,560 Pool Name Active Pending > INFO 13:21:38,560 ReadStage 8 7555 > INFO 13:21:38,561 RequestResponseStage 0 0 > INFO 13:21:38,561 ReadRepairStage 0 0 > > > > is there anyway to tell what node3 was doing? or at least is there any > way to make it not slowdown the whole cluster? > -- Frank Duan aiMatch frank@aimatch.com c: 703.869.9951 www.aiMatch.com --20cf300fb0216c032b04a926d50f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
"Dropped read=C2=A0message" might be an indicator of capacit= y issue. We experienced the similar issue with 0.7.6.

<= div>We ended up adding two extra nodes and physically rebooted the offendin= g node(s).

The entire cluster then calmed down.

On Thu, Jul 28, 2011 at 2:24 PM, Yan Chunlu <springrider@gmail.com= > wrote:
I have three nodes and RF=3D3.here is the c= urrent ring:


Address Status State Load Owns Token

84944475733633104818662955375549269696
node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102
node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360
node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696


it is very un-balanced and I would like to re-balance it using
"nodetool move" asap. unfortunately I haven't been run node r= epair for
a long time.

aaron suggested it's better to run node repair on every node then re-ba= lance it.


problem is the node3 is in heavy-load currently, and the entire
cluster slow down if I start doing node repair. I have to
disablegossip and disablethrift to stop the repair.

only cassandra running on that server and I have no idea what it was
doing. the cpu load is about 20+ currently. compcationstats and
netstats shows it was not doing anything.

I have change client to not to connect to node3, but still, it seems
in heavy load and io utils is 100%.


the log seems normal(although not sure what about the "Dropped read message" thing):

=C2=A0INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving<= br> 2563726360=C2=A0used;= max is 4248829952 =C2=A0WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms
=C2=A0INFO 13:21:38,560 Pool Name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0Active =C2=A0 Pending
=C2=A0INFO 13:21:38,560 ReadStage =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8 =C2=A0 =C2=A0 =C2=A07555
=C2=A0INFO 13:21:38,561 RequestResponseStage =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0
=C2=A0INFO 13:21:38,561 ReadRepairStage =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0



is there anyway to tell what node3 was doing? or at least is there any
way to make it not slowdown the whole cluster?



--
Frank Duan
aiMatch
c: 703.869.9951

--20cf300fb0216c032b04a926d50f--