Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B63796640 for ; Sun, 31 Jul 2011 16:52:32 +0000 (UTC) Received: (qmail 7863 invoked by uid 500); 31 Jul 2011 16:52:30 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 7760 invoked by uid 500); 31 Jul 2011 16:52:29 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 7752 invoked by uid 99); 31 Jul 2011 16:52:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 31 Jul 2011 16:52:29 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of springrider@gmail.com designates 209.85.215.44 as permitted sender) Received: from [209.85.215.44] (HELO mail-ew0-f44.google.com) (209.85.215.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 31 Jul 2011 16:52:24 +0000 Received: by ewy19 with SMTP id 19so2860411ewy.31 for ; Sun, 31 Jul 2011 09:52:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=gzhO5Tq9I2vhMp4dBlIIErnWUOtmbx8xI8d3gPabPF0=; b=qVTEZ3MzgQEDq+tLTp0O2sXsQ4h5llea3nuP9/mfDKDsUd42VJBDhrRlTKer8lfNxE lJmdV69pw6f8ec+rNTlZlYd+rcWT7XfCdyDw7P4BTWM75HxLF44pocwa0NYxoV1NSAHk 99XkCmJCCK0zMxnmHR+GR8d5zi2AQ2n3lmgLY= Received: by 10.213.19.146 with SMTP id a18mr64389ebb.72.1312131121236; Sun, 31 Jul 2011 09:52:01 -0700 (PDT) MIME-Version: 1.0 Received: by 10.213.113.13 with HTTP; Sun, 31 Jul 2011 09:51:41 -0700 (PDT) In-Reply-To: References: From: Yan Chunlu Date: Mon, 1 Aug 2011 00:51:41 +0800 Message-ID: Subject: Re: how to solve one node is in heavy load in unbalanced cluster To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0015174c10fa614bac04a96055d0 --0015174c10fa614bac04a96055d0 Content-Type: text/plain; charset=ISO-8859-1 any help? thanks! On Fri, Jul 29, 2011 at 12:05 PM, Yan Chunlu wrote: > and by the way, my RF=3 and the other two nodes have much more capacity, > why does they always routed the request to node3? > > coud I do a rebalance now? before node repair? > > > On Fri, Jul 29, 2011 at 12:01 PM, Yan Chunlu wrote: > >> add new nodes seems added more pressure to the cluster? how about your >> data size? >> >> >> On Fri, Jul 29, 2011 at 4:16 AM, Frank Duan wrote: >> >>> "Dropped read message" might be an indicator of capacity issue. We >>> experienced the similar issue with 0.7.6. >>> >>> We ended up adding two extra nodes and physically rebooted the offending >>> node(s). >>> >>> The entire cluster then calmed down. >>> >>> On Thu, Jul 28, 2011 at 2:24 PM, Yan Chunlu wrote: >>> >>>> I have three nodes and RF=3.here is the current ring: >>>> >>>> >>>> Address Status State Load Owns Token >>>> >>>> 84944475733633104818662955375549269696 >>>> node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102 >>>> node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360 >>>> node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696 >>>> >>>> >>>> it is very un-balanced and I would like to re-balance it using >>>> "nodetool move" asap. unfortunately I haven't been run node repair for >>>> a long time. >>>> >>>> aaron suggested it's better to run node repair on every node then >>>> re-balance it. >>>> >>>> >>>> problem is the node3 is in heavy-load currently, and the entire >>>> cluster slow down if I start doing node repair. I have to >>>> disablegossip and disablethrift to stop the repair. >>>> >>>> only cassandra running on that server and I have no idea what it was >>>> doing. the cpu load is about 20+ currently. compcationstats and >>>> netstats shows it was not doing anything. >>>> >>>> I have change client to not to connect to node3, but still, it seems >>>> in heavy load and io utils is 100%. >>>> >>>> >>>> the log seems normal(although not sure what about the "Dropped read >>>> message" thing): >>>> >>>> INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving >>>> 2563726360 used; max is 4248829952 >>>> WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms >>>> INFO 13:21:38,560 Pool Name Active Pending >>>> INFO 13:21:38,560 ReadStage 8 7555 >>>> INFO 13:21:38,561 RequestResponseStage 0 0 >>>> INFO 13:21:38,561 ReadRepairStage 0 0 >>>> >>>> >>>> >>>> is there anyway to tell what node3 was doing? or at least is there any >>>> way to make it not slowdown the whole cluster? >>>> >>> >>> >>> >>> -- >>> Frank Duan >>> aiMatch >>> frank@aimatch.com >>> c: 703.869.9951 >>> www.aiMatch.com >>> >>> >> > --0015174c10fa614bac04a96055d0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable any help? thanks!

On Fri, Jul 29, 2011 at= 12:05 PM, Yan Chunlu <springrider@gmail.com> wrote:
and by the way, my RF=3D3 and the other two nodes have much more capacity, = why does they always routed the request to node3?

coud I= do a rebalance now? before node repair?


On Fri, Jul 29, 2011 at 12:01 PM, Yan Chunlu= <springrider@gmail.com> wrote:
add new nodes seems added more pressure =A0t= o the cluster? =A0how about your data size?


On Fri, Jul 29, 2011 at 4:16 AM, Frank Duan = <frank@aimatch.com> wrote:
"Dropped read=A0message" migh= t be an indicator of capacity issue. We experienced the similar issue with = 0.7.6.

We ended up adding two extra nodes and physically reboo= ted the offending node(s).

The entire cluster then calmed down.

On Thu, Jul 28, 2011 at 2:24 PM, Yan= Chunlu <springrider@gmail.com> wrote:
I have three nodes and RF=3D3.here is the cu= rrent ring:


Address Status State Load Owns Token

84944475733633104818662955375549269696
node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102
node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360
node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696


it is very un-balanced and I would like to re-balance it using
"nodetool move" asap. unfortunately I haven't been run node r= epair for
a long time.

aaron suggested it's better to run node repair on every node then re-ba= lance it.


problem is the node3 is in heavy-load currently, and the entire
cluster slow down if I start doing node repair. I have to
disablegossip and disablethrift to stop the repair.

only cassandra running on that server and I have no idea what it was
doing. the cpu load is about 20+ currently. compcationstats and
netstats shows it was not doing anything.

I have change client to not to connect to node3, but still, it seems
in heavy load and io utils is 100%.


the log seems normal(although not sure what about the "Dropped read message" thing):

=A0INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving
2563726= 360=A0used; max is 4248829952
=A0WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms
=A0INFO 13:21:38,560 Pool Name =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Activ= e =A0 Pending
=A0INFO 13:21:38,560 ReadStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 8 =A0 =A0 =A07555
=A0INFO 13:21:38,561 RequestResponseStage =A0 =A0 =A0 =A0 =A0 =A0 =A00 =A0 = =A0 =A0 =A0 0
=A0INFO 13:21:38,561 ReadRepairStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 = =A0 =A0 =A0 =A0 0



is there anyway to tell what node3 was doing? or at least is there any
way to make it not slowdown the whole cluster?



--
Frank Duan
aiMatch




--0015174c10fa614bac04a96055d0--