incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Completely removing a node from the cluster
Date Tue, 23 Aug 2011 08:45:23 GMT
I normally link to the data stax article to avoid having to actually write those words :)

http://www.datastax.com/docs/0.8/troubleshooting/index#view-of-ring-differs-between-some-nodes
A
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 23/08/2011, at 7:45 PM, Jonathan Colby wrote:

> I ran into this.  I also tried log_ring_state=false which also did not help.   The way
I got through this was to stop the entire cluster and start the nodes one-by-one.   
> 
> I realize this is not a practical solution for everyone, but if you can afford to stop
the cluster for a few minutes, it's worth a try.
> 
> 
> On Aug 23, 2011, at 9:26 AM, aaron morton wrote:
> 
>> I'm running low on ideas for this one. Anyone else ? 
>> 
>> If the phantom node is not listed in the ring, other nodes should not be storing
hints for it. You can see what nodes they are storing hints for via JConsole. 
>> 
>> You can try a rolling restart passing the JVM opt -Dcassandra.load_ring_state=false
However if the phantom node is been passed around in the gossip state it will probably just
come back again. 
>> 
>> Cheers
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote:
>> 
>>> Could this ghost node be causing my hints column family to grow to this size?
 I also crash after about 24 hours due to commit logs growth taking up all the drive space.
 A manual nodetool flush keeps it under control though.
>>> 
>>> 
>>>              Column Family: HintsColumnFamily
>>>              SSTable count: 6
>>>              Space used (live): 666480352
>>>              Space used (total): 666480352
>>>              Number of Keys (estimate): 768
>>>              Memtable Columns Count: 1043
>>>              Memtable Data Size: 461773
>>>              Memtable Switch Count: 3
>>>              Read Count: 38
>>>              Read Latency: 131.289 ms.
>>>              Write Count: 582108
>>>              Write Latency: 0.019 ms.
>>>              Pending Tasks: 0
>>>              Key cache capacity: 7
>>>              Key cache size: 6
>>>              Key cache hit rate: 0.8333333333333334
>>>              Row cache: disabled
>>>              Compacted row minimum size: 2816160
>>>              Compacted row maximum size: 386857368
>>>              Compacted row mean size: 120432714
>>> 
>>> Is there a way for me to manually remove this dead node?
>>> 
>>> -----Original Message-----
>>> From: Bryce Godfrey [mailto:Bryce.Godfrey@azaleos.com] 
>>> Sent: Sunday, August 21, 2011 9:09 PM
>>> To: user@cassandra.apache.org
>>> Subject: RE: Completely removing a node from the cluster
>>> 
>>> It's been at least 4 days now.
>>> 
>>> -----Original Message-----
>>> From: aaron morton [mailto:aaron@thelastpickle.com] 
>>> Sent: Sunday, August 21, 2011 3:16 PM
>>> To: user@cassandra.apache.org
>>> Subject: Re: Completely removing a node from the cluster
>>> 
>>> I see the mistake I made about ring, gets the endpoint list from the same place
but uses the token's to drive the whole process. 
>>> 
>>> I'm guessing here, don't have time to check all the code. But there is a 3 day
timeout in the gossip system. Not sure if it applies in this case. 
>>> 
>>> Anyone know ?
>>> 
>>> Cheers
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:
>>> 
>>>> Both .2 and .3 list the same from the mbean that Unreachable is empty collection,
and Live node lists all 3 nodes still:
>>>> 192.168.20.2
>>>> 192.168.20.3
>>>> 192.168.20.1
>>>> 
>>>> The removetoken was done a few days ago, and I believe the remove was done
from .2
>>>> 
>>>> Here is what ring outlook looks like, not sure why I get that token on the
empty first line either:
>>>> Address         DC          Rack        Status State   Load            Owns
   Token
>>>>                                                                         
  85070591730234615865843651857942052864
>>>> 192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       50.00%
 0
>>>> 192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       50.00%
 85070591730234615865843651857942052864
>>>> 
>>>> Yes, both nodes show the same thing when doing a describe cluster, that .1
is unreachable.
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: aaron morton [mailto:aaron@thelastpickle.com] 
>>>> Sent: Sunday, August 21, 2011 4:23 AM
>>>> To: user@cassandra.apache.org
>>>> Subject: Re: Completely removing a node from the cluster
>>>> 
>>>> Unreachable nodes in either did not respond to the message or were known
to be down and were not sent a message. 
>>>> The way the node lists are obtained for the ring command and describe cluster
are the same. So it's a bit odd. 
>>>> 
>>>> Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean
? What do the LiveNode and UnrechableNodes attributes say ? 
>>>> 
>>>> Also how long ago did you remove the token and on which machine? Do both
20.2 and 20.3 think 20.1 is still around ? 
>>>> 
>>>> Cheers
>>>> 
>>>> 
>>>> -----------------
>>>> Aaron Morton
>>>> Freelance Cassandra Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>> 
>>>> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
>>>> 
>>>>> I'm on 0.8.4
>>>>> 
>>>>> I have removed a dead node from the cluster using nodetool removetoken
command, and moved one of the remaining nodes to rebalance the tokens.  Everything looks fine
when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine,
owning 50% of the tokens.
>>>>> 
>>>>> However, I can still see it being considered as part of the cluster from
the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is
still queuing up hints for the node, or any other issues it may cause:
>>>>> 
>>>>> Cluster Information:
>>>>> Snitch: org.apache.cassandra.locator.SimpleSnitch
>>>>> Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>>>> Schema versions:
>>>>>    dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>>>>>    UNREACHABLE: [192.168.20.1]
>>>>> 
>>>>> 
>>>>> Do I need to do something else to completely remove this node?
>>>>> 
>>>>> Thanks,
>>>>> Bryce
>>>> 
>>> 
>> 
> 


Mime
View raw message