incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andras Szerdahelyi <andras.szerdahe...@ignitionone.com>
Subject Re: 33million hinted handoffs from nowhere
Date Mon, 25 Mar 2013 12:41:25 GMT
Thanks again!

Nodetool gossipinfo correctly lists the existing nodes only

- last change to ring topology was months ago
- I started the problem node with -Dcassanda.load_ring_state=false and
observed no unusual behaviour ( this is with hinted handoff OFF. With
hinted handoff ON I see the same behaviour as before )
- tried to assassinate these endpoints but got UnknownHostExceptions for
all 3
- tried remove token but got java.lang.UnsupportedOperationException:
Token not found. For all of these

I have an update however. I changed which node I use to coordinate
mutations and its happening elsewhere too, same tokens.
I'm clueless as to what could have caused my ring to end up in such an
inconsistent state.. How can there be pending hints for endpoints
gossipinfo does not know about?

Regards,
Andras


On 21/03/13 17:56, "aaron morton" <aaron@thelastpickle.com> wrote:

>Take a look a nodetool gossipinfo it will tell you what nodes the node
>thinks are around.
>
>If you can see something in gossip that should not be there you have a
>few choices.
>
>* if it's less than 3 days since a change to ring topology wait and see
>if C* sorts it out.
>* try restarting nodes with -Dcassanda.load_ring_state=false as a JVM opt
>in cassandra-env.sh. This may not work because when the node restarts
>others will tell it the bad info
>* try the unsafeAssassinateEndpoint() call on the Gossiper MBean via JMX
>https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassand
>ra/gms/GossiperMBean.java#L28
>
>Cheers
> 
>-----------------
>Aaron Morton
>Freelance Cassandra Consultant
>New Zealand
>
>@aaronmorton
>http://www.thelastpickle.com
>
>On 20/03/2013, at 11:10 PM, Andras Szerdahelyi
><andras.szerdahelyi@ignitionone.com> wrote:
>
>> Thanks, Aaron.
>> 
>> I re-enabled hinted handoff and noted the following
>> 	€ no host is marked down in nodetool ring
>> 	€ No host is logged as down or dead in logs
>> 	€ No "started hinted handoff for.." is logged
>> 	€ The hinted handoff manager Mbean lists pending hints to ..
>>(drumroll) 3 non-existent nodes?
>> 
>> Here's my ring
>> 
>> Note: Ownership information does not include topology, please specify a
>>keyspace. 
>> Address         DC          Rack        Status State   Load
>>Owns                Token
>>                 
>>                   113427455640312821154458202477256070785
>> XX.XX.1.113    ione-us-atl rack1       Up     Normal  382.08 GB
>>33.33%              0
>> XX.XX.31.10      ione-us-lvg rack1       Up     Normal  266.04 GB
>>0.00%               100
>> XX.XX.0.71     ione-be-bru rack1       Up     Normal  85.86 GB
>>0.00%               200
>> XX.XX.2.86     ione-analytics-us-atlrack1       Up     Normal  153.6 GB
>>       0.00%               300
>> XX.XX.1.45     ione-us-atl-ssdrack1       Up     Normal  296.72 GB
>> 0.00%               400
>> XX.XX.2.85     ione-analytics-us-atlrack1       Up     Normal  100.3 GB
>>       33.33%              56713727820156410577229101238628035542
>> XX.XX.1.204    ione-us-atl rack1       Up     Normal  341.55 GB
>>16.67%              85070591730234615865843651857942052864
>> XX.XX.11      ione-us-lvg rack1       Up     Normal  320.22 GB
>>0.00%               85070591730234615865843651857942052964
>> XX.XX.2.87     ione-analytics-us-atlrack1       Up     Normal  166.48
>>GB       16.67%              113427455640312821154458202477256070785
>> 
>> And these are nodes pending hints according to the Mbean
>> 
>> 166860289390734216023086131251507064403
>> 143927757573010354572009627285182898319
>> 24295500190543334543807902779534181373
>> 
>> Err.. Unbalanced ring ? Opscenter says otherwise ( "OpsCenter has
>>detected that the token ranges are evenly distributed across the nodes
>>in each data center. Load rebalancing is not necessary at this time." )
>> 
>> I appreciate your help so far! In the mean time hintedhandOFF because
>>my mutation TP can't keep up with this traffic, not to mention
>>compaction..
>> 
>> Thanks,
>> Andras
>> 
>> ps: all nodes are cassandra-1.1.6-dse-p1
>> 
>> From: aaron morton <aaron@thelastpickle.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Monday 18 March 2013 17:51
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Re: 33million hinted handoffs from nowhere
>> 
>> You can check which nodes hints are being held for using the JMX api.
>>Look for the org.apache.cassandra.db:type=HintedHandoffManager MBean and
>>call the listEndpointsPendingHints() function.
>> 
>> There are two points where hints may be stored, if the node is down
>>when the request started or if the node timed out and did not return
>>before rpc_timeout. To check for the first, look for log lines about a
>>node being "dead" on the coordinator. To check for the second look for
>>dropped messages on the other nodes. This will be logged, or you can use
>>nodetool tpstats to look for them.
>> 
>> Cheers
>>   
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 15/03/2013, at 2:30 AM, Andras Szerdahelyi
>><andras.szerdahelyi@ignitionone.com> wrote:
>> 
>>> ( The previous letter was sent prematurely, sorry. )
>>> 
>>> This node is the only node being written to, but the Cfs being written
>>>replicate to almost all of the other nodes
>>> My understanding is that hinted handoff is mutations kept around on
>>>the coordinator node, to be replayed when the target node re-appears on
>>>the ring. All my nodes are up and again, no hinted handoff is logged on
>>>the node itself
>>> 
>>> Thanks!
>>> Andras
>>> 
>>> From: Andras Szerdahelyi <andras.szerdahelyi@ignitionone.com>
>>> Date: Thursday 14 March 2013 14:25
>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>> Subject: 33million hinted handoffs from nowhere
>>> 
>>> Hi list,
>>> 
>>> I am experiencing seemingly uncontrollable and unexplained growth of
>>>my HintedHandoff CF on a single node. Unexplained because there are no
>>>hinted handoffs being logged on the node, uncontrollable because I see
>>>33 million inserts in cfstats and the size of the stables is over 10
>>>gigs all in an hour of uptime.
>>> 
>>> 
>>> I have done the following to try and reproduce this:
>>> 
>>> - shut down my cluster
>>> - on all nodes: remove sstables from the HintsColumnFamily data dir
>>> - on all nodes: remove commit logs
>>> - start all nodes but the one that¹s showing this problem
>>> - nothing is writing to any of the nodes. There are no hinted handoff
>>>going on anywhere
>>> - bring back the node in question last
>>> - few seconds after boot:
>>> 
>>>                 Column Family: HintsColumnFamily
>>>                 SSTable count: 1
>>>                 Space used (live): 44946532
>>>                 Space used (total): 44946532
>>>                 Number of Keys (estimate): 256
>>>                 Memtable Columns Count: 17840
>>>                 Memtable Data Size: 17569909
>>>                 Memtable Switch Count: 2
>>>                 Read Count: 0
>>>                 Read Latency: NaN ms.
>>>                 Write Count: 184836
>>>                 Write Latency: 0.668 ms.
>>>                 Pending Tasks: 0
>>>                 Bloom Filter False Postives: 0
>>>                 Bloom Filter False Ratio: 0.00000
>>>                 Bloom Filter Space Used: 16
>>>                 Compacted row minimum size: 20924301
>>>                 Compacted row maximum size: 25109160
>>>                 Compacted row mean size: 25109160
>>> 
>>> 
>>> 
>>> 
>> 
>


Mime
View raw message