Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=content-type
	:mime-version:subject:from:in-reply-to:date
	:content-transfer-encoding:message-id:references:to; q=dns; s=
	thelastpickle.com; b=gA1VUd1h4XPBbuvikp64jmjWgKkkYrRyt5xH4ucBqPE
	tABbBDp+F0xZDE3MHqo6QAaLgdryu5NWWN4YYVzU1fa01ulP5stQPfpbldcAO9gZ
	+yiWRbjGgZlU8NZkeDtEqnG9ewCPIwa1ABOhFYulfMye7tGD3GgQKtHu2kJ6+uCc
	=
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1244.3)
Subject: Re: Completely removing a node from the cluster
From: aaron morton <aaron@thelastpickle.com>
In-Reply-To: 
 <376CEC01195C894CB9F8A3C274029A96AF258687@FISH-EX2K10-01.azaleos.net>
Date: Tue, 23 Aug 2011 19:26:58 +1200
Content-Transfer-Encoding: quoted-printable
Message-Id: <81FAAD69-6DA8-41A9-86E0-F5B66D55FD34@thelastpickle.com>
References: 
 <376CEC01195C894CB9F8A3C274029A96AF25338F@FISH-EX2K10-01.azaleos.net>
 <593A1215-C630-4D6B-B905-4779389A782B@thelastpickle.com>
 <376CEC01195C894CB9F8A3C274029A96AF256B8B@FISH-EX2K10-01.azaleos.net>
 <504F4C34-7C5C-43D5-8821-18758D389F16@thelastpickle.com>
 <376CEC01195C894CB9F8A3C274029A96AF256DAD@FISH-EX2K10-01.azaleos.net>
 <376CEC01195C894CB9F8A3C274029A96AF258687@FISH-EX2K10-01.azaleos.net>
To: user@cassandra.apache.org

I'm running low on ideas for this one. Anyone else ?=20

If the phantom node is not listed in the ring, other nodes should not be =
storing hints for it. You can see what nodes they are storing hints for =
via JConsole.=20

You can try a rolling restart passing the JVM opt =
-Dcassandra.load_ring_state=3Dfalse However if the phantom node is been =
passed around in the gossip state it will probably just come back again.=20=


Cheers


-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote:

> Could this ghost node be causing my hints column family to grow to =
this size?  I also crash after about 24 hours due to commit logs growth =
taking up all the drive space.  A manual nodetool flush keeps it under =
control though.
>=20
>=20
>                Column Family: HintsColumnFamily
>                SSTable count: 6
>                Space used (live): 666480352
>                Space used (total): 666480352
>                Number of Keys (estimate): 768
>                Memtable Columns Count: 1043
>                Memtable Data Size: 461773
>                Memtable Switch Count: 3
>                Read Count: 38
>                Read Latency: 131.289 ms.
>                Write Count: 582108
>                Write Latency: 0.019 ms.
>                Pending Tasks: 0
>                Key cache capacity: 7
>                Key cache size: 6
>                Key cache hit rate: 0.8333333333333334
>                Row cache: disabled
>                Compacted row minimum size: 2816160
>                Compacted row maximum size: 386857368
>                Compacted row mean size: 120432714
>=20
> Is there a way for me to manually remove this dead node?
>=20
> -----Original Message-----
> From: Bryce Godfrey [mailto:Bryce.Godfrey@azaleos.com]=20
> Sent: Sunday, August 21, 2011 9:09 PM
> To: user@cassandra.apache.org
> Subject: RE: Completely removing a node from the cluster
>=20
> It's been at least 4 days now.
>=20
> -----Original Message-----
> From: aaron morton [mailto:aaron@thelastpickle.com]=20
> Sent: Sunday, August 21, 2011 3:16 PM
> To: user@cassandra.apache.org
> Subject: Re: Completely removing a node from the cluster
>=20
> I see the mistake I made about ring, gets the endpoint list from the =
same place but uses the token's to drive the whole process.=20
>=20
> I'm guessing here, don't have time to check all the code. But there is =
a 3 day timeout in the gossip system. Not sure if it applies in this =
case.=20
>=20
> Anyone know ?
>=20
> Cheers
>=20
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>=20
> On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:
>=20
>> Both .2 and .3 list the same from the mbean that Unreachable is empty =
collection, and Live node lists all 3 nodes still:
>> 192.168.20.2
>> 192.168.20.3
>> 192.168.20.1
>>=20
>> The removetoken was done a few days ago, and I believe the remove was =
done from .2
>>=20
>> Here is what ring outlook looks like, not sure why I get that token =
on the empty first line either:
>> Address         DC          Rack        Status State   Load           =
 Owns    Token
>>                                                                       =
       85070591730234615865843651857942052864
>> 192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       =
50.00%  0
>> 192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       =
50.00%  85070591730234615865843651857942052864
>>=20
>> Yes, both nodes show the same thing when doing a describe cluster, =
that .1 is unreachable.
>>=20
>>=20
>> -----Original Message-----
>> From: aaron morton [mailto:aaron@thelastpickle.com]=20
>> Sent: Sunday, August 21, 2011 4:23 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Completely removing a node from the cluster
>>=20
>> Unreachable nodes in either did not respond to the message or were =
known to be down and were not sent a message.=20
>> The way the node lists are obtained for the ring command and describe =
cluster are the same. So it's a bit odd.=20
>>=20
>> Can you connect to JMX and have a look at the o.a.c.db.StorageService =
MBean ? What do the LiveNode and UnrechableNodes attributes say ?=20
>>=20
>> Also how long ago did you remove the token and on which machine? Do =
both 20.2 and 20.3 think 20.1 is still around ?=20
>>=20
>> Cheers
>>=20
>>=20
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>=20
>> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
>>=20
>>> I'm on 0.8.4
>>>=20
>>> I have removed a dead node from the cluster using nodetool =
removetoken command, and moved one of the remaining nodes to rebalance =
the tokens.  Everything looks fine when I run nodetool ring now, as it =
only lists the remaining 2 nodes and they both look fine, owning 50% of =
the tokens.
>>>=20
>>> However, I can still see it being considered as part of the cluster =
from the Cassandra-cli (192.168.20.1 being the removed node) and I'm =
worried that the cluster is still queuing up hints for the node, or any =
other issues it may cause:
>>>=20
>>> Cluster Information:
>>> Snitch: org.apache.cassandra.locator.SimpleSnitch
>>> Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>> Schema versions:
>>>      dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, =
192.168.20.3]
>>>      UNREACHABLE: [192.168.20.1]
>>>=20
>>>=20
>>> Do I need to do something else to completely remove this node?
>>>=20
>>> Thanks,
>>> Bryce
>>=20
>=20