I normally link to the data stax article to avoid having to actually write those words :)
http://www.datastax.com/docs/0.8/troubleshooting/index#view-of-ring-differs-between-some-nodes
A
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
On 23/08/2011, at 7:45 PM, Jonathan Colby wrote:
> I ran into this. I also tried log_ring_state=false which also did not help. The way
I got through this was to stop the entire cluster and start the nodes one-by-one.
>
> I realize this is not a practical solution for everyone, but if you can afford to stop
the cluster for a few minutes, it's worth a try.
>
>
> On Aug 23, 2011, at 9:26 AM, aaron morton wrote:
>
>> I'm running low on ideas for this one. Anyone else ?
>>
>> If the phantom node is not listed in the ring, other nodes should not be storing
hints for it. You can see what nodes they are storing hints for via JConsole.
>>
>> You can try a rolling restart passing the JVM opt -Dcassandra.load_ring_state=false
However if the phantom node is been passed around in the gossip state it will probably just
come back again.
>>
>> Cheers
>>
>>
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote:
>>
>>> Could this ghost node be causing my hints column family to grow to this size?
I also crash after about 24 hours due to commit logs growth taking up all the drive space.
A manual nodetool flush keeps it under control though.
>>>
>>>
>>> Column Family: HintsColumnFamily
>>> SSTable count: 6
>>> Space used (live): 666480352
>>> Space used (total): 666480352
>>> Number of Keys (estimate): 768
>>> Memtable Columns Count: 1043
>>> Memtable Data Size: 461773
>>> Memtable Switch Count: 3
>>> Read Count: 38
>>> Read Latency: 131.289 ms.
>>> Write Count: 582108
>>> Write Latency: 0.019 ms.
>>> Pending Tasks: 0
>>> Key cache capacity: 7
>>> Key cache size: 6
>>> Key cache hit rate: 0.8333333333333334
>>> Row cache: disabled
>>> Compacted row minimum size: 2816160
>>> Compacted row maximum size: 386857368
>>> Compacted row mean size: 120432714
>>>
>>> Is there a way for me to manually remove this dead node?
>>>
>>> -----Original Message-----
>>> From: Bryce Godfrey [mailto:Bryce.Godfrey@azaleos.com]
>>> Sent: Sunday, August 21, 2011 9:09 PM
>>> To: user@cassandra.apache.org
>>> Subject: RE: Completely removing a node from the cluster
>>>
>>> It's been at least 4 days now.
>>>
>>> -----Original Message-----
>>> From: aaron morton [mailto:aaron@thelastpickle.com]
>>> Sent: Sunday, August 21, 2011 3:16 PM
>>> To: user@cassandra.apache.org
>>> Subject: Re: Completely removing a node from the cluster
>>>
>>> I see the mistake I made about ring, gets the endpoint list from the same place
but uses the token's to drive the whole process.
>>>
>>> I'm guessing here, don't have time to check all the code. But there is a 3 day
timeout in the gossip system. Not sure if it applies in this case.
>>>
>>> Anyone know ?
>>>
>>> Cheers
>>>
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:
>>>
>>>> Both .2 and .3 list the same from the mbean that Unreachable is empty collection,
and Live node lists all 3 nodes still:
>>>> 192.168.20.2
>>>> 192.168.20.3
>>>> 192.168.20.1
>>>>
>>>> The removetoken was done a few days ago, and I believe the remove was done
from .2
>>>>
>>>> Here is what ring outlook looks like, not sure why I get that token on the
empty first line either:
>>>> Address DC Rack Status State Load Owns
Token
>>>>
85070591730234615865843651857942052864
>>>> 192.168.20.2 datacenter1 rack1 Up Normal 79.53 GB 50.00%
0
>>>> 192.168.20.3 datacenter1 rack1 Up Normal 42.63 GB 50.00%
85070591730234615865843651857942052864
>>>>
>>>> Yes, both nodes show the same thing when doing a describe cluster, that .1
is unreachable.
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: aaron morton [mailto:aaron@thelastpickle.com]
>>>> Sent: Sunday, August 21, 2011 4:23 AM
>>>> To: user@cassandra.apache.org
>>>> Subject: Re: Completely removing a node from the cluster
>>>>
>>>> Unreachable nodes in either did not respond to the message or were known
to be down and were not sent a message.
>>>> The way the node lists are obtained for the ring command and describe cluster
are the same. So it's a bit odd.
>>>>
>>>> Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean
? What do the LiveNode and UnrechableNodes attributes say ?
>>>>
>>>> Also how long ago did you remove the token and on which machine? Do both
20.2 and 20.3 think 20.1 is still around ?
>>>>
>>>> Cheers
>>>>
>>>>
>>>> -----------------
>>>> Aaron Morton
>>>> Freelance Cassandra Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>>
>>>> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
>>>>
>>>>> I'm on 0.8.4
>>>>>
>>>>> I have removed a dead node from the cluster using nodetool removetoken
command, and moved one of the remaining nodes to rebalance the tokens. Everything looks fine
when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine,
owning 50% of the tokens.
>>>>>
>>>>> However, I can still see it being considered as part of the cluster from
the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is
still queuing up hints for the node, or any other issues it may cause:
>>>>>
>>>>> Cluster Information:
>>>>> Snitch: org.apache.cassandra.locator.SimpleSnitch
>>>>> Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>>>> Schema versions:
>>>>> dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>>>>> UNREACHABLE: [192.168.20.1]
>>>>>
>>>>>
>>>>> Do I need to do something else to completely remove this node?
>>>>>
>>>>> Thanks,
>>>>> Bryce
>>>>
>>>
>>
>
|