Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 19914 invoked from network); 20 Sep 2010 18:05:09 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 20 Sep 2010 18:05:09 -0000 Received: (qmail 98430 invoked by uid 500); 20 Sep 2010 18:05:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 98383 invoked by uid 500); 20 Sep 2010 18:05:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 98375 invoked by uid 99); 20 Sep 2010 18:05:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Sep 2010 18:05:07 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of shimi.k@gmail.com designates 74.125.83.172 as permitted sender) Received: from [74.125.83.172] (HELO mail-pv0-f172.google.com) (74.125.83.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Sep 2010 18:04:58 +0000 Received: by pvg7 with SMTP id 7so1872471pvg.31 for ; Mon, 20 Sep 2010 11:04:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=Xtj1Bf++tLfF2D1vllYqFMa5lLFhAZsH7BQn+tHplwY=; b=cMaXNavMdH/bUJLIELFjFpMlQEc4q12yvxnDL+B7KawwZ0jWC/EWSL6+qhUdQQTR0P fdGOgD54zak1+osOpryv/4TaXo/Uvtr8+IynU6RA1bSgyqdMxtuWOIl2FfcK4ogHf+YS ER6N30uf/rNRtplVei99+XDBLzmDMB9zcazzw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=CnClBioio3rDkZYDZLPZr4C2UnXe8TfbGHjM0swNEm/eq8nYXwMlaycun6jCxFq0To 6T5DG+senNE6imHy/KNg9IYxy6npcLNiCz63kPYOIMxXj1oc4QUunMqfkhZ3dT7wo8eI Q4INIQqYtpzK/472M5l1+T03YgE+wQkVyD0TA= MIME-Version: 1.0 Received: by 10.114.80.10 with SMTP id d10mr10318558wab.180.1285005876424; Mon, 20 Sep 2010 11:04:36 -0700 (PDT) Received: by 10.220.80.129 with HTTP; Mon, 20 Sep 2010 11:04:36 -0700 (PDT) In-Reply-To: References: Date: Mon, 20 Sep 2010 20:04:36 +0200 Message-ID: Subject: Re: FatClient Gossip error and some other problems From: shimi To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016364581b0cc8da00490b4be13 X-Virus-Checked: Checked by ClamAV on apache.org --0016364581b0cc8da00490b4be13 Content-Type: text/plain; charset=ISO-8859-1 I was patient (although it is hard when you have millions of requests which are not served in time). I was waiting for a long time. There was nothing in the Logs and in JMX. Shimi On Mon, Sep 20, 2010 at 6:12 PM, Gary Dusbabek wrote: > On Mon, Sep 20, 2010 at 09:51, shimi wrote: > > I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter). > > I replaced all of the servers in the cluster (0.6.4) with new ones > (0.6.5). > > My old cluster was unbalanced since I was using Random Partitioner and I > > bootstrapped all the nodes without specifying their tokens. > > > > Since I wanted the the cluster to be balanced I first added all the new > > nodes one after the other (with the right tokens this time) and then I > run > > decommission on all the old ones, one after the other. > > One of the decommissioned nodes began throwing too many open files errors > > while It was decommissioning taking other nodes with him. After the > second > > try I decided to stop it and run removetoken on his token from one of the > > other nodes. After that everything went well except that in the end one > of > > the nodes looked unbalanced. > > > > I decided to run repair on the cluster. What I got is totally unbalanced > > nodes with way to much data then what is suppose to be. each node had > x2-x4 > > more data. > > I run cleanup and all of them except the one which was unbalanced to > begin > > with got back to the size they were suppose to be. > > Now whenever I try to run cleanup on this node I get: > > > > INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java > > (line 339) AntiCompacting ... > > INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129) > GC > > for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520 > used; > > max is 6552551424 > > INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150) > > Pool Name Active Pending > > INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156) > > STREAM-STAGE 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156) > > RESPONSE-STAGE 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156) > > ROW-READ-STAGE 8 717 > > INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) > > LB-OPERATIONS 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) > > MISCELLANEOUS-POOL 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) > > GMFD 0 2 > > INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156) > > CONSISTENCY-MANAGER 0 1 > > INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156) > > LB-TARGET 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156) > > ROW-MUTATION-STAGE 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156) > > MESSAGE-STREAMING-POOL 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156) > > LOAD-BALANCER-STAGE 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156) > > FLUSH-SORTER-POOL 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156) > > MEMTABLE-POST-FLUSHER 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156) > > AE-SERVICE-STAGE 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156) > > FLUSH-WRITER-POOL 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156) > > HINTED-HANDOFF-POOL 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161) > > CompactionManager n/a 0 > > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402 > > SSTableDeletingReference.java (line 104) Deleted ... > > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727 > > SSTableDeletingReference.java (line 104) Deleted ... > > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730 > > SSTableDeletingReference.java (line 104) Deleted ... > > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735 > > SSTableDeletingReference.java (line 104) Deleted ... > > > > and after that I saw an increase in the node response time and the number > > ROW-READ-STAGE pending tasks. Since there was no indication that > something > > is wrong or that the node is doing anyuthing (logs ,nodetool and JMX), > the > > only thing that I could have done is to restart the server. > > > > I don't know if this is related but every hour I see this error (I think > it > > is the IP of the machine that I couldn't decommission properly): > > > > INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402) > FatClient > > /X.X.X.X has been silent for 3600000ms, removing from gossip > > ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip > error > > java.util.ConcurrentModificationException > > at java.util.Hashtable$Enumerator.next(Hashtable.java:1031) > > at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383) > > at > > org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93) > > at java.util.TimerThread.mainLoop(Timer.java:512) > > at java.util.TimerThread.run(Timer.java:462) > > INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node > > /X.X.X.X is now part of the cluster > > > > Does anyone have any idea how can I cleanup the problematic node? > > You may just need to be patient. Have you tried monitoring the > CompactionManager in jmx to see if it is doing things? > > > Does anyone have any idea how can I get rid of the Gossip error? > > This is CASSANDRA-1494. You can ignore it. > > Gary. > --0016364581b0cc8da00490b4be13 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I was patient (although it is hard when you have millions = of requests which are not served in time). I was waiting for a long time. T= here was nothing in the Logs and in JMX.

Shimi

On Mon, Sep 20, 2010 at 6:12 PM, Gary Dusbabek <gdusbabek@gmail.com> wrote:<= br>
On Mon, Sep 20, 2010 at 09:51, shimi <= shimi.k@gmail.com> wrote:
> I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter).=
> I replaced all of the servers in the cluster (0.6.4) with new ones (0.= 6.5).
> My old cluster was unbalanced since I was using Random Partitioner and= I
> bootstrapped all the nodes without specifying their tokens.
>
> Since I wanted the the cluster to be balanced I first added all the ne= w
> nodes one after the other (with the right tokens this time) and then I= run
> decommission on all the old ones, one after the other.
> One of the decommissioned nodes began throwing too many open files err= ors
> while It was decommissioning taking other nodes with him. After the se= cond
> try I decided to stop it and run removetoken on his token from one of = the
> other nodes. After that everything went well except that in the end on= e of
> the nodes looked unbalanced.
>
> I decided to run repair on the cluster. What I got is totally unbalanc= ed
> nodes with way to much data then what is suppose to be. each node had = x2-x4
> more data.
> I run cleanup and all of them except the one which was unbalanced to b= egin
> with got back to the size they were suppose to be.
> Now whenever I try to run cleanup on this node I get:
>
> =A0INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.= java
> (line 339) AntiCompacting ...
> =A0INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line= 129) GC
> for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520= used;
> max is 6552551424
> =A0INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line= 150)
> Pool Name=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Act= ive=A0=A0 Pending
> =A0INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line= 156)
> STREAM-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line= 156)
> RESPONSE-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line= 156)
> ROW-READ-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 8=A0=A0=A0=A0=A0=A0 717
> =A0INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line= 156)
> LB-OPERATIONS=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line= 156)
> MISCELLANEOUS-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0= =A0=A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line= 156)
> GMFD=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 2
> =A0INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line= 156)
> CONSISTENCY-MANAGER=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0= =A0=A0=A0=A0=A0=A0 1
> =A0INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line= 156)
> LB-TARGET=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line= 156)
> ROW-MUTATION-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0= =A0=A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line= 156)
> MESSAGE-STREAMING-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0= =A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line= 156)
> LOAD-BALANCER-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0= =A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line= 156)
> FLUSH-SORTER-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0= =A0=A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line= 156)
> MEMTABLE-POST-FLUSHER=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0= =A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line= 156)
> AE-SERVICE-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0= =A0=A0=A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line= 156)
> FLUSH-WRITER-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0= =A0=A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line= 156)
> HINTED-HANDOFF-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0= =A0=A0=A0=A0=A0=A0 0
> =A0INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line= 161)
> CompactionManager=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 n/a=A0=A0= =A0=A0=A0=A0=A0=A0 0
> =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402
> SSTableDeletingReference.java (line 104) Deleted ...
> =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727
> SSTableDeletingReference.java (line 104) Deleted ...
> =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730
> SSTableDeletingReference.java (line 104) Deleted ...
> =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735
> SSTableDeletingReference.java (line 104) Deleted ...
>
> and after that I saw an increase in the node response time and the num= ber
> ROW-READ-STAGE pending tasks. Since there was no indication that somet= hing
> is wrong or that the node is doing anyuthing (logs ,nodetool and JMX),= the
> only thing that I could have done is to restart the server.
>
> I don't know if this is related but every hour I see this error (I= think it
> is the IP of the machine that I couldn't decommission properly): >
> =A0INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402) Fat= Client
> /X.X.X.X has been silent for 3600000ms, removing from gossip
> ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip= error
> java.util.ConcurrentModificationException
> =A0=A0=A0 at java.util.Hashtable$Enumerator.next(Hashtable.java:1031)<= br> > =A0=A0=A0 at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.= java:383)
> =A0=A0=A0 at
> org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93= )
> =A0=A0=A0 at java.util.TimerThread.mainLoop(Timer.java:512)
> =A0=A0=A0 at java.util.TimerThread.run(Timer.java:462)
> =A0INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node=
> /X.X.X.X is now part of the cluster
>
> Does anyone have any idea how can I cleanup the problematic node?

You may just need to be patient. =A0Have you tried monitoring t= he
CompactionManager in jmx to see if it is doing things?

> Does anyone have any idea how can I get rid of the Gossip error?

This is CASSANDRA-1494. You can ignore it.

Gary.

--0016364581b0cc8da00490b4be13--