Good to hear it's resolved but a cluster restart is less than ideal.
The closest thing I can think of is https://issues.apache.org/jira/browse/CASSANDRA-2371
Which is resolved in 0.7.5.
Cheers
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
On 28 May 2011, at 11:13, Jonathan Colby wrote:
> Glad to report I fixed this problem.
> 1. I added the load_ring_state=false flag
> 2. I was able to arrange a time where I could take down the whole
> cluster and bring it back up.
>
> After that the "phantom" node disappeared.
>
> On Fri, May 27, 2011 at 12:48 AM, <jonathan.colby@gmail.com> wrote:
>> Hi Aaron - Thanks alot for the great feedback. I'll try your suggestion on
>> removing it as an endpoint with jmx.
>>
>> On , aaron morton <aaron@thelastpickle.com> wrote:
>>> Off the top of my head the simple way to stop invalid end point state been
>>> passed around is a full cluster stop. Obviously thats not an option. The
>>> problem is if one node has the IP is will share it around with the others.
>>>
>>>
>>>
>>> Out of interest take a look at the o.a.c.db.FailureDetector MBean
>>> getAllEndpointStates() function. That returns the end point state held by
>>> the Gossiper. I think you should see the Phantom IP listed in there.
>>>
>>>
>>>
>>> If it's only on some nodes *perhaps* restarting the node with the JVM
>>> option -Dcassandra.load_ring_state=false *may* help. That will stop the node
>>> from loading it's save ring state and force it to get it via gossip. Again,
>>> if there are other nodes with the phantom IP it may just get it again.
>>>
>>>
>>>
>>> I'll do some digging and try to get back to you. This pops up from time to
>>> time and thinking out loud I wonder if it would be possible to add a new
>>> application state that purges an IP from the ring. e.g.
>>> VersionedValue.STATUS_PURGED that works with a ttl so it goes through X
>>> number of gossip rounds and then disappears.
>>>
>>>
>>>
>>> Hope that helps.
>>>
>>>
>>>
>>>
>>>
>>> -----------------
>>>
>>> Aaron Morton
>>>
>>> Freelance Cassandra Developer
>>>
>>> @aaronmorton
>>>
>>> http://www.thelastpickle.com
>>>
>>>
>>>
>>> On 26 May 2011, at 19:58, Jonathan Colby wrote:
>>>
>>>
>>>
>>>> @Aaron -
>>>
>>>>
>>>
>>>> Unfortunately I'm still seeing message like: " is down", removing from
>>>> gossip, although with not the same frequency.
>>>
>>>>
>>>
>>>> And repair/move jobs don't seem to try to stream data to the removed
>>>> node anymore.
>>>
>>>>
>>>
>>>> Anyone know how to totally purge any stored gossip/endpoint data on
>>>> nodes that were removed from the cluster. Or what might be happening here
>>>> otherwise?
>>>
>>>>
>>>
>>>>
>>>
>>>> On May 26, 2011, at 9:10 AM, aaron morton wrote:
>>>
>>>>
>>>
>>>>> cool. I was going to suggest that but as you already had the move
>>>>> running I thought it may be a little drastic.
>>>
>>>>>
>>>
>>>>> Did it show any progress ? If the IP address is not responding there
>>>>> should have been some sort of error.
>>>
>>>>>
>>>
>>>>> Cheers
>>>
>>>>>
>>>
>>>>> -----------------
>>>
>>>>> Aaron Morton
>>>
>>>>> Freelance Cassandra Developer
>>>
>>>>> @aaronmorton
>>>
>>>>> http://www.thelastpickle.com
>>>
>>>>>
>>>
>>>>> On 26 May 2011, at 15:28, jonathan.colby@gmail.com wrote:
>>>
>>>>>
>>>
>>>>>> Seems like it had something to do with stale endpoint information.
I
>>>>>> did a rolling restart of the whole cluster and that seemed to trigger
the
>>>>>> nodes to remove the node that was decommissioned.
>>>
>>>>>>
>>>
>>>>>> On , aaron morton aaron@thelastpickle.com> wrote:
>>>
>>>>>>> Is it showing progress ? It may just be a problem with the
>>>>>>> information printed out.
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>> Can you check from the other nodes in the cluster to see if they
are
>>>>>>> receiving the stream ?
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>> cheers
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>> -----------------
>>>
>>>>>>>
>>>
>>>>>>> Aaron Morton
>>>
>>>>>>>
>>>
>>>>>>> Freelance Cassandra Developer
>>>
>>>>>>>
>>>
>>>>>>> @aaronmorton
>>>
>>>>>>>
>>>
>>>>>>> http://www.thelastpickle.com
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>> On 26 May 2011, at 00:42, Jonathan Colby wrote:
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>> I recently removed a node (with decommission) from our cluster.
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>> I added a couple new nodes and am now trying to rebalance
the
>>>>>>>> cluster using nodetool move.
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>> However, netstats shows that the node being "moved" is trying
to
>>>>>>>> stream data to the node that I already decommissioned yesterday.
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>> The removed node was powered-off, taken out of dns, its IP
is not
>>>>>>>> even pingable. It was never a seed neither.
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>> This is cassandra 0.7.5 on 64bit linux. How do I tell the
cluster
>>>>>>>> that this node is gone? Gossip should have detected this.
The ring
>>>>>>>> commands shows the correct cluster IPs.
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>> Here is a portion of netstats. 10.46.108.102 is the node
which was
>>>>>>>> removed.
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>> Mode: Leaving: streaming data to other nodes
>>>
>>>>>>>
>>>
>>>>>>>> Streaming to: /10.46.108.102
>>>
>>>>>>>
>>>
>>>>>>>>
>>>>>>>> /var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97
>>>
>>>>>>>
>>>
>>>>>>>> ...................
>>>
>>>>>>>
>>>
>>>>>>>>
>>>>>>>> 5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266)
>>>
>>>>>>>
>>>
>>>>>>>> progress=280574376402/12434049900 - 2256%
>>>
>>>>>>>
>>>
>>>>>>>> .....
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>> Note 10.46.108.102 is NOT part of the ring.
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>> Address Status State Load Owns Token
>>>
>>>>>>>
>>>
>>>>>>>>
>>>>>>>> 148873535527910577765226390751398592512
>>>
>>>>>>>
>>>
>>>>>>>> 10.46.108.100 Up Normal 71.73 GB 12.50% 0
>>>
>>>>>>>
>>>
>>>>>>>> 10.46.108.101 Up Normal 109.69 GB 12.50%
>>>>>>>> 21267647932558653966460912964485513216
>>>
>>>>>>>
>>>
>>>>>>>> 10.47.108.100 Up Leaving 281.95 GB 37.50%
>>>>>>>> 85070591730234615865843651857942052863
>>>
>>>>>>>> 10.47.108.102 Up Normal 210.77 GB 0.00%
>>>>>>>> 85070591730234615865843651857942052864
>>>
>>>>>>>
>>>
>>>>>>>> 10.47.108.101 Up Normal 289.59 GB 16.67%
>>>>>>>> 113427455640312821154458202477256070484
>>>
>>>>>>>
>>>
>>>>>>>> 10.46.108.103 Up Normal 299.87 GB 8.33%
>>>>>>>> 127605887595351923798765477786913079296
>>>
>>>>>>>
>>>
>>>>>>>> 10.47.108.103 Up Normal 94.99 GB 12.50%
>>>>>>>> 148873535527910577765226390751398592511
>>>
>>>>>>>
>>>
>>>>>>>> 10.46.108.104 Up Normal 103.01 GB 0.00%
>>>>>>>> 148873535527910577765226390751398592512
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>>>
>>>
>>>>>
>>>
>>>>
>>>
>>>
>>>
|