Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 76157 invoked from network); 20 Sep 2010 16:13:19 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 20 Sep 2010 16:13:19 -0000 Received: (qmail 7088 invoked by uid 500); 20 Sep 2010 16:13:17 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 7044 invoked by uid 500); 20 Sep 2010 16:13:16 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 7036 invoked by uid 99); 20 Sep 2010 16:13:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Sep 2010 16:13:16 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gdusbabek@gmail.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-ww0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Sep 2010 16:13:08 +0000 Received: by wwj40 with SMTP id 40so83127wwj.25 for ; Mon, 20 Sep 2010 09:12:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:reply-to :in-reply-to:references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=ltGQbEQ5VYT8Y9NAR0I5majVENJ790cwNVCTactV5hs=; b=AUS42k1EfNmABLjNwoZKCa6r3dnohHkTZ+72RxJuyJSn6bGtdlESwNoXYrU76yj2i9 5BFUzq8OodtrgWiHIrQK+eEr/f6ug4BPnA0Rz9kvvAWG0R+TLbsF3O20o3v03U02O2vL ePfsHI4hkZBu71o9ZRs7Oj+fOUG0bGoDT/nfY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:content-type:content-transfer-encoding; b=KHrq8Y6qv8NZT78dTgQmRFxE176GS3htL80+7RsjveKTjmAIoVNJugHhpDK40yBMtt lfs9nmUgqoXCqrKUZB9SMAzapbNy5aPVm/zbdooMXM9DpFkfdyEM/VNhS0VNSHgxX3bP UORhkhdsK6AmefRLthqg8kQenf9lMF7+zFz4k= MIME-Version: 1.0 Received: by 10.216.72.16 with SMTP id s16mr8070196wed.20.1284999168038; Mon, 20 Sep 2010 09:12:48 -0700 (PDT) Received: by 10.216.237.231 with HTTP; Mon, 20 Sep 2010 09:12:47 -0700 (PDT) Reply-To: gdusbabek@gmail.com In-Reply-To: References: Date: Mon, 20 Sep 2010 11:12:47 -0500 Message-ID: Subject: Re: FatClient Gossip error and some other problems From: Gary Dusbabek To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Mon, Sep 20, 2010 at 09:51, shimi wrote: > I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter). > I replaced all of the servers in the cluster (0.6.4) with new ones (0.6.5= ). > My old cluster was unbalanced since I was using Random Partitioner and I > bootstrapped all the nodes without specifying their tokens. > > Since I wanted the the cluster to be balanced I first added all the new > nodes one after the other (with the right tokens this time) and then I ru= n > decommission on all the old ones, one after the other. > One of the decommissioned nodes began throwing too many open files errors > while It was decommissioning taking other nodes with him. After the secon= d > try I decided to stop it and run removetoken on his token from one of the > other nodes. After that everything went well except that in the end one o= f > the nodes looked unbalanced. > > I decided to run repair on the cluster. What I got is totally unbalanced > nodes with way to much data then what is suppose to be. each node had x2-= x4 > more data. > I run cleanup and all of them except the one which was unbalanced to begi= n > with got back to the size they were suppose to be. > Now whenever I try to run cleanup on this node I get: > > =A0INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.jav= a > (line 339) AntiCompacting ... > =A0INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 12= 9) GC > for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520 us= ed; > max is 6552551424 > =A0INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 15= 0) > Pool Name=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Active= =A0=A0 Pending > =A0INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 15= 6) > STREAM-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 15= 6) > RESPONSE-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0= =A0=A0=A0=A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 15= 6) > ROW-READ-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 8= =A0=A0=A0=A0=A0=A0 717 > =A0INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 15= 6) > LB-OPERATIONS=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= 0=A0=A0=A0=A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 15= 6) > MISCELLANEOUS-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0= =A0=A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 15= 6) > GMFD=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 2 > =A0INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 15= 6) > CONSISTENCY-MANAGER=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0= =A0=A0=A0=A0=A0 1 > =A0INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 15= 6) > LB-TARGET=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 15= 6) > ROW-MUTATION-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0= =A0=A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 15= 6) > MESSAGE-STREAMING-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0= =A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 15= 6) > LOAD-BALANCER-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0= =A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 15= 6) > FLUSH-SORTER-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0= =A0=A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 15= 6) > MEMTABLE-POST-FLUSHER=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0= =A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 15= 6) > AE-SERVICE-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0= =A0=A0=A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 15= 6) > FLUSH-WRITER-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0= =A0=A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 15= 6) > HINTED-HANDOFF-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0= =A0=A0=A0=A0=A0 0 > =A0INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 16= 1) > CompactionManager=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 n/a=A0=A0=A0= =A0=A0=A0=A0=A0 0 > =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402 > SSTableDeletingReference.java (line 104) Deleted ... > =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727 > SSTableDeletingReference.java (line 104) Deleted ... > =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730 > SSTableDeletingReference.java (line 104) Deleted ... > =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735 > SSTableDeletingReference.java (line 104) Deleted ... > > and after that I saw an increase in the node response time and the number > ROW-READ-STAGE pending tasks. Since there was no indication that somethin= g > is wrong or that the node is doing anyuthing (logs ,nodetool and JMX), th= e > only thing that I could have done is to restart the server. > > I don't know if this is related but every hour I see this error (I think = it > is the IP of the machine that I couldn't decommission properly): > > =A0INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402) FatCli= ent > /X.X.X.X has been silent for 3600000ms, removing from gossip > ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip er= ror > java.util.ConcurrentModificationException > =A0=A0=A0 at java.util.Hashtable$Enumerator.next(Hashtable.java:1031) > =A0=A0=A0 at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.jav= a:383) > =A0=A0=A0 at > org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93) > =A0=A0=A0 at java.util.TimerThread.mainLoop(Timer.java:512) > =A0=A0=A0 at java.util.TimerThread.run(Timer.java:462) > =A0INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node > /X.X.X.X is now part of the cluster > > Does anyone have any idea how can I cleanup the problematic node? You may just need to be patient. Have you tried monitoring the CompactionManager in jmx to see if it is doing things? > Does anyone have any idea how can I get rid of the Gossip error? This is CASSANDRA-1494. You can ignore it. Gary.