incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Faraaz Sareshwala <fsareshw...@quantcast.com>
Subject Re: Large number of pending gossip stage tasks in nodetool tpstats
Date Thu, 08 Aug 2013 02:30:53 GMT
Thanks Aaron. The node that was behaving this way was a production node so I had to take some
drastic measures to get it back to doing the right thing. It's no longer behaving this way
after wiping the system tables and having cassandra resync the schema from other nodes. In
hindsight, maybe I could have gotten away with a nodetool resetlocalschema. Since the node
has been restored to a working state, I sadly can't run commands on it to investigate any
longer.

When the node was in this hosed state, I did check nodetool gossipinfo. The bad node had the
correct schema hash; the same as the rest of the nodes in the cluster. However, it thought
every other node in the cluster had another schema hash, most likely the older one everyone
migrated from.

This issue occurred again today on three machines so I feel it may occur again. Typically
I see it when our entire datacenter updates it's configuration and restarts along an hour.
All nodes point to the same list of seeds, but the restart order is random across one your.
I'm not sure if this information helps at all.

Are there any specific things I should look for when it does occur again?

Thank you,
Faraaz

On Aug 7, 2013, at 7:23 PM, "Aaron Morton" <aaron@thelastpickle.com> wrote:

>> When looking at nodetool
>> gossipinfo, I notice that this node has updated to the latest schema hash, but
>> that it thinks other nodes in the cluster are on the older version.
> What does describe cluster in cassandra-cli say ? It will let you know if there are multiple
schema versions in the cluster. 
> 
> Can you include the output from nodetool gossipinfo ? 
> 
> You may also get some value from increase the log level for org.apache.cassandra.gms.Gossiper
to DEBUG so you can see what's going on. It's unusual for only the gossip pool to backup.
If there were issues with GC taking CPU we would expect to see it across the board. 
> 
> Cheers
> 
> 
> 
> -----------------
> Aaron Morton
> Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 7/08/2013, at 7:52 AM, Faraaz Sareshwala <fsareshwala@quantcast.com> wrote:
> 
>> I'm running cassandra-1.2.8 in a cluster with 45 nodes across three racks. All
>> nodes are well behaved except one. Whenever I start this node, it starts
>> churning CPU. Running nodetool tpstats, I notice that the number of pending
>> gossip stage tasks is constantly increasing [1]. When looking at nodetool
>> gossipinfo, I notice that this node has updated to the latest schema hash, but
>> that it thinks other nodes in the cluster are on the older version. I've tried
>> to drain, decommission, wipe node data, bootstrap, and repair the node. However,
>> the node just started doing the same thing again.
>> 
>> Has anyone run into this issue before? Can anyone provide any insight into why
>> this node is the only one in the cluster having problems? Are there any easy
>> fixes?
>> 
>> Thank you,
>> Faraaz
>> 
>> [1] $ /cassandra/bin/nodetool tpstats
>> Pool Name                    Active   Pending      Completed   Blocked  All time
blocked
>> ReadStage                         0         0              8         0          
      0
>> RequestResponseStage              0         0          49198         0          
      0
>> MutationStage                     0         0         224286         0          
      0
>> ReadRepairStage                   0         0              0         0          
      0
>> ReplicateOnWriteStage             0         0              0         0          
      0
>> GossipStage                       1      2213             18         0          
      0
>> AntiEntropyStage                  0         0              0         0          
      0
>> MigrationStage                    0         0             72         0          
      0
>> MemtablePostFlusher               0         0            102         0          
      0
>> FlushWriter                       0         0             99         0          
      0
>> MiscStage                         0         0              0         0          
      0
>> commitlog_archiver                0         0              0         0          
      0
>> InternalResponseStage             0         0             19         0          
      0
>> HintedHandoff                     0         0              2         0          
      0
>> 
>> Message type           Dropped
>> RANGE_SLICE                  0
>> READ_REPAIR                  0
>> BINARY                       0
>> READ                         0
>> MUTATION                     0
>> _TRACE                       0
>> REQUEST_RESPONSE             0
> 

Mime
View raw message