Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 39BB66EDE for ; Fri, 29 Jul 2011 04:02:30 +0000 (UTC) Received: (qmail 64908 invoked by uid 500); 29 Jul 2011 04:02:27 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 64460 invoked by uid 500); 29 Jul 2011 04:02:09 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 63688 invoked by uid 99); 29 Jul 2011 04:02:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 04:02:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of springrider@gmail.com designates 209.85.161.44 as permitted sender) Received: from [209.85.161.44] (HELO mail-fx0-f44.google.com) (209.85.161.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 04:01:59 +0000 Received: by fxe6 with SMTP id 6so2129967fxe.31 for ; Thu, 28 Jul 2011 21:01:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=nVkekqRhBh/2cwd7x2FqJkc2dq1SMbrlDgF4eXllOeA=; b=m7UoDURheaAgQenFvjSwU3K7AkKXCWacSiBG6GOsjcYBsVirOStP6KlrdDDV1aIp9v Gp9b/gwKgw/ppD2rNgS8ZKyw04NWGhi19of1SdVV38AedX9IsaFS0iPTPWTcE9Ffn2Re 8VS+92zDIC2pb8dUjnuZ8hKdrsHuoRAMbK0/M= Received: by 10.213.14.17 with SMTP id e17mr57779eba.72.1311912097239; Thu, 28 Jul 2011 21:01:37 -0700 (PDT) MIME-Version: 1.0 Received: by 10.213.113.13 with HTTP; Thu, 28 Jul 2011 21:01:17 -0700 (PDT) In-Reply-To: References: From: Yan Chunlu Date: Fri, 29 Jul 2011 12:01:17 +0800 Message-ID: Subject: Re: how to solve one node is in heavy load in unbalanced cluster To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0015174bef2e88520204a92d56f6 --0015174bef2e88520204a92d56f6 Content-Type: text/plain; charset=ISO-8859-1 add new nodes seems added more pressure to the cluster? how about your data size? On Fri, Jul 29, 2011 at 4:16 AM, Frank Duan wrote: > "Dropped read message" might be an indicator of capacity issue. We > experienced the similar issue with 0.7.6. > > We ended up adding two extra nodes and physically rebooted the offending > node(s). > > The entire cluster then calmed down. > > On Thu, Jul 28, 2011 at 2:24 PM, Yan Chunlu wrote: > >> I have three nodes and RF=3.here is the current ring: >> >> >> Address Status State Load Owns Token >> >> 84944475733633104818662955375549269696 >> node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102 >> node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360 >> node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696 >> >> >> it is very un-balanced and I would like to re-balance it using >> "nodetool move" asap. unfortunately I haven't been run node repair for >> a long time. >> >> aaron suggested it's better to run node repair on every node then >> re-balance it. >> >> >> problem is the node3 is in heavy-load currently, and the entire >> cluster slow down if I start doing node repair. I have to >> disablegossip and disablethrift to stop the repair. >> >> only cassandra running on that server and I have no idea what it was >> doing. the cpu load is about 20+ currently. compcationstats and >> netstats shows it was not doing anything. >> >> I have change client to not to connect to node3, but still, it seems >> in heavy load and io utils is 100%. >> >> >> the log seems normal(although not sure what about the "Dropped read >> message" thing): >> >> INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving >> 2563726360 used; max is 4248829952 >> WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms >> INFO 13:21:38,560 Pool Name Active Pending >> INFO 13:21:38,560 ReadStage 8 7555 >> INFO 13:21:38,561 RequestResponseStage 0 0 >> INFO 13:21:38,561 ReadRepairStage 0 0 >> >> >> >> is there anyway to tell what node3 was doing? or at least is there any >> way to make it not slowdown the whole cluster? >> > > > > -- > Frank Duan > aiMatch > frank@aimatch.com > c: 703.869.9951 > www.aiMatch.com > > --0015174bef2e88520204a92d56f6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable add new nodes seems added more pressure =A0to the cluster? =A0how about you= r data size?

On Fri, Jul 29, 2011 at 4:16= AM, Frank Duan <= frank@aimatch.com> wrote:
"Dropped read=A0message" mig= ht be an indicator of capacity issue. We experienced the similar issue with= 0.7.6.

We ended up adding two extra nodes and physically reboo= ted the offending node(s).

The entire cluster then calmed down.

On Thu, Jul 28, 2011 at= 2:24 PM, Yan Chunlu <springrider@gmail.com> wrote:
I have three nodes and RF=3D3.here is the cu= rrent ring:


Address Status State Load Owns Token

84944475733633104818662955375549269696
node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102
node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360
node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696


it is very un-balanced and I would like to re-balance it using
"nodetool move" asap. unfortunately I haven't been run node r= epair for
a long time.

aaron suggested it's better to run node repair on every node then re-ba= lance it.


problem is the node3 is in heavy-load currently, and the entire
cluster slow down if I start doing node repair. I have to
disablegossip and disablethrift to stop the repair.

only cassandra running on that server and I have no idea what it was
doing. the cpu load is about 20+ currently. compcationstats and
netstats shows it was not doing anything.

I have change client to not to connect to node3, but still, it seems
in heavy load and io utils is 100%.


the log seems normal(although not sure what about the "Dropped read message" thing):

=A0INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving
2563726= 360=A0used; max is 4248829952
=A0WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms
=A0INFO 13:21:38,560 Pool Name =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Activ= e =A0 Pending
=A0INFO 13:21:38,560 ReadStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 8 =A0 =A0 =A07555
=A0INFO 13:21:38,561 RequestResponseStage =A0 =A0 =A0 =A0 =A0 =A0 =A00 =A0 = =A0 =A0 =A0 0
=A0INFO 13:21:38,561 ReadRepairStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 = =A0 =A0 =A0 =A0 0



is there anyway to tell what node3 was doing? or at least is there any
way to make it not slowdown the whole cluster?



--
Frank Duan
aiMatch
c: 703= .869.9951


--0015174bef2e88520204a92d56f6--