incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Faraaz Sareshwala <fsareshw...@quantcast.com>
Subject Re: Large number of pending gossip stage tasks in nodetool tpstats
Date Thu, 08 Aug 2013 02:32:52 GMT
And by that last statement, I mean are there any further things I should look for given the
information in my response? I'll definitely look at implementing your suggestions and see
what I can find.

On Aug 7, 2013, at 7:31 PM, "Faraaz Sareshwala" <fsareshwala@quantcast.com> wrote:

> Thanks Aaron. The node that was behaving this way was a production node so I had to take
some drastic measures to get it back to doing the right thing. It's no longer behaving this
way after wiping the system tables and having cassandra resync the schema from other nodes.
In hindsight, maybe I could have gotten away with a nodetool resetlocalschema. Since the node
has been restored to a working state, I sadly can't run commands on it to investigate any
longer.
> 
> When the node was in this hosed state, I did check nodetool gossipinfo. The bad node
had the correct schema hash; the same as the rest of the nodes in the cluster. However, it
thought every other node in the cluster had another schema hash, most likely the older one
everyone migrated from.
> 
> This issue occurred again today on three machines so I feel it may occur again. Typically
I see it when our entire datacenter updates it's configuration and restarts along an hour.
All nodes point to the same list of seeds, but the restart order is random across one your.
I'm not sure if this information helps at all.
> 
> Are there any specific things I should look for when it does occur again?
> 
> Thank you,
> Faraaz
> 
> On Aug 7, 2013, at 7:23 PM, "Aaron Morton" <aaron@thelastpickle.com> wrote:
> 
>>> When looking at nodetool
>>> gossipinfo, I notice that this node has updated to the latest schema hash, but
>>> that it thinks other nodes in the cluster are on the older version.
>> What does describe cluster in cassandra-cli say ? It will let you know if there are
multiple schema versions in the cluster. 
>> 
>> Can you include the output from nodetool gossipinfo ? 
>> 
>> You may also get some value from increase the log level for org.apache.cassandra.gms.Gossiper
to DEBUG so you can see what's going on. It's unusual for only the gossip pool to backup.
If there were issues with GC taking CPU we would expect to see it across the board. 
>> 
>> Cheers
>> 
>> 
>> 
>> -----------------
>> Aaron Morton
>> Cassandra Consultant
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 7/08/2013, at 7:52 AM, Faraaz Sareshwala <fsareshwala@quantcast.com> wrote:
>> 
>>> I'm running cassandra-1.2.8 in a cluster with 45 nodes across three racks. All
>>> nodes are well behaved except one. Whenever I start this node, it starts
>>> churning CPU. Running nodetool tpstats, I notice that the number of pending
>>> gossip stage tasks is constantly increasing [1]. When looking at nodetool
>>> gossipinfo, I notice that this node has updated to the latest schema hash, but
>>> that it thinks other nodes in the cluster are on the older version. I've tried
>>> to drain, decommission, wipe node data, bootstrap, and repair the node. However,
>>> the node just started doing the same thing again.
>>> 
>>> Has anyone run into this issue before? Can anyone provide any insight into why
>>> this node is the only one in the cluster having problems? Are there any easy
>>> fixes?
>>> 
>>> Thank you,
>>> Faraaz
>>> 
>>> [1] $ /cassandra/bin/nodetool tpstats
>>> Pool Name                    Active   Pending      Completed   Blocked  All time
blocked
>>> ReadStage                         0         0              8         0      
          0
>>> RequestResponseStage              0         0          49198         0      
          0
>>> MutationStage                     0         0         224286         0      
          0
>>> ReadRepairStage                   0         0              0         0      
          0
>>> ReplicateOnWriteStage             0         0              0         0      
          0
>>> GossipStage                       1      2213             18         0      
          0
>>> AntiEntropyStage                  0         0              0         0      
          0
>>> MigrationStage                    0         0             72         0      
          0
>>> MemtablePostFlusher               0         0            102         0      
          0
>>> FlushWriter                       0         0             99         0      
          0
>>> MiscStage                         0         0              0         0      
          0
>>> commitlog_archiver                0         0              0         0      
          0
>>> InternalResponseStage             0         0             19         0      
          0
>>> HintedHandoff                     0         0              2         0      
          0
>>> 
>>> Message type           Dropped
>>> RANGE_SLICE                  0
>>> READ_REPAIR                  0
>>> BINARY                       0
>>> READ                         0
>>> MUTATION                     0
>>> _TRACE                       0
>>> REQUEST_RESPONSE             0
>> 

Mime
View raw message