cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shimi <shim...@gmail.com>
Subject Re: FatClient Gossip error and some other problems
Date Mon, 20 Sep 2010 18:04:36 GMT
I was patient (although it is hard when you have millions of requests which
are not served in time). I was waiting for a long time. There was nothing in
the Logs and in JMX.

Shimi

On Mon, Sep 20, 2010 at 6:12 PM, Gary Dusbabek <gdusbabek@gmail.com> wrote:

> On Mon, Sep 20, 2010 at 09:51, shimi <shimi.k@gmail.com> wrote:
> > I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter).
> > I replaced all of the servers in the cluster (0.6.4) with new ones
> (0.6.5).
> > My old cluster was unbalanced since I was using Random Partitioner and I
> > bootstrapped all the nodes without specifying their tokens.
> >
> > Since I wanted the the cluster to be balanced I first added all the new
> > nodes one after the other (with the right tokens this time) and then I
> run
> > decommission on all the old ones, one after the other.
> > One of the decommissioned nodes began throwing too many open files errors
> > while It was decommissioning taking other nodes with him. After the
> second
> > try I decided to stop it and run removetoken on his token from one of the
> > other nodes. After that everything went well except that in the end one
> of
> > the nodes looked unbalanced.
> >
> > I decided to run repair on the cluster. What I got is totally unbalanced
> > nodes with way to much data then what is suppose to be. each node had
> x2-x4
> > more data.
> > I run cleanup and all of them except the one which was unbalanced to
> begin
> > with got back to the size they were suppose to be.
> > Now whenever I try to run cleanup on this node I get:
> >
> >  INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java
> > (line 339) AntiCompacting ...
> >  INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129)
> GC
> > for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520
> used;
> > max is 6552551424
> >  INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150)
> > Pool Name                    Active   Pending
> >  INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
> > STREAM-STAGE                      0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
> > RESPONSE-STAGE                    0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156)
> > ROW-READ-STAGE                    8       717
> >  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> > LB-OPERATIONS                     0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> > MISCELLANEOUS-POOL                0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> > GMFD                              0         2
> >  INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
> > CONSISTENCY-MANAGER               0         1
> >  INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
> > LB-TARGET                         0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156)
> > ROW-MUTATION-STAGE                0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
> > MESSAGE-STREAMING-POOL            0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
> > LOAD-BALANCER-STAGE               0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156)
> > FLUSH-SORTER-POOL                 0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
> > MEMTABLE-POST-FLUSHER             0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
> > AE-SERVICE-STAGE                  0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
> > FLUSH-WRITER-POOL                 0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
> > HINTED-HANDOFF-POOL               0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161)
> > CompactionManager               n/a         0
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402
> > SSTableDeletingReference.java (line 104) Deleted ...
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727
> > SSTableDeletingReference.java (line 104) Deleted ...
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730
> > SSTableDeletingReference.java (line 104) Deleted ...
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735
> > SSTableDeletingReference.java (line 104) Deleted ...
> >
> > and after that I saw an increase in the node response time and the number
> > ROW-READ-STAGE pending tasks. Since there was no indication that
> something
> > is wrong or that the node is doing anyuthing (logs ,nodetool and JMX),
> the
> > only thing that I could have done is to restart the server.
> >
> > I don't know if this is related but every hour I see this error (I think
> it
> > is the IP of the machine that I couldn't decommission properly):
> >
> >  INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402)
> FatClient
> > /X.X.X.X has been silent for 3600000ms, removing from gossip
> > ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip
> error
> > java.util.ConcurrentModificationException
> >     at java.util.Hashtable$Enumerator.next(Hashtable.java:1031)
> >     at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
> >     at
> > org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
> >     at java.util.TimerThread.mainLoop(Timer.java:512)
> >     at java.util.TimerThread.run(Timer.java:462)
> >  INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node
> > /X.X.X.X is now part of the cluster
> >
> > Does anyone have any idea how can I cleanup the problematic node?
>
> You may just need to be patient.  Have you tried monitoring the
> CompactionManager in jmx to see if it is doing things?
>
> > Does anyone have any idea how can I get rid of the Gossip error?
>
> This is CASSANDRA-1494. You can ignore it.
>
> Gary.
>

Mime
View raw message