cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Colby <jonathan.co...@gmail.com>
Subject Re: Re: nodetool move trying to stream data to node no longer in cluster
Date Fri, 27 May 2011 23:13:48 GMT
Glad to report I fixed this problem.
1. I added the load_ring_state=false flag
2. I was able to arrange a time where I could take down the whole
cluster and bring it back up.

After that the "phantom" node disappeared.

On Fri, May 27, 2011 at 12:48 AM,  <jonathan.colby@gmail.com> wrote:
> Hi Aaron - Thanks alot for the great feedback. I'll try your suggestion on
> removing it as an endpoint with jmx.
>
> On , aaron morton <aaron@thelastpickle.com> wrote:
>> Off the top of my head the simple way to stop invalid end point state been
>> passed around is a full cluster stop. Obviously thats not an option. The
>> problem is if one node has the IP is will share it around with the others.
>>
>>
>>
>> Out of interest take a look at the o.a.c.db.FailureDetector MBean
>> getAllEndpointStates() function. That returns the end point state held by
>> the Gossiper. I think you should see the Phantom IP listed in there.
>>
>>
>>
>> If it's only on some nodes *perhaps* restarting the node with the JVM
>> option -Dcassandra.load_ring_state=false *may* help. That will stop the node
>> from loading it's save ring state and force it to get it via gossip. Again,
>> if there are other nodes with the phantom IP it may just get it again.
>>
>>
>>
>> I'll do some digging and try to get back to you. This pops up from time to
>> time and thinking out loud I wonder if it would be possible to add a new
>> application state that purges an IP from the ring. e.g.
>> VersionedValue.STATUS_PURGED that works with a ttl so it goes through X
>> number of gossip rounds and then disappears.
>>
>>
>>
>> Hope that helps.
>>
>>
>>
>>
>>
>> -----------------
>>
>> Aaron Morton
>>
>> Freelance Cassandra Developer
>>
>> @aaronmorton
>>
>> http://www.thelastpickle.com
>>
>>
>>
>> On 26 May 2011, at 19:58, Jonathan Colby wrote:
>>
>>
>>
>> > @Aaron -
>>
>> >
>>
>> > Unfortunately I'm still seeing message like:  " is down", removing from
>> > gossip, although with not the same frequency.
>>
>> >
>>
>> > And repair/move jobs don't seem to try to stream data to the removed
>> > node anymore.
>>
>> >
>>
>> > Anyone know how to totally purge any stored gossip/endpoint data on
>> > nodes that were removed from the cluster.  Or what might be happening here
>> > otherwise?
>>
>> >
>>
>> >
>>
>> > On May 26, 2011, at 9:10 AM, aaron morton wrote:
>>
>> >
>>
>> >> cool. I was going to suggest that but as you already had the move
>> >> running I thought it may be a little drastic.
>>
>> >>
>>
>> >> Did it show any progress ? If the IP address is not responding there
>> >> should have been some sort of error.
>>
>> >>
>>
>> >> Cheers
>>
>> >>
>>
>> >> -----------------
>>
>> >> Aaron Morton
>>
>> >> Freelance Cassandra Developer
>>
>> >> @aaronmorton
>>
>> >> http://www.thelastpickle.com
>>
>> >>
>>
>> >> On 26 May 2011, at 15:28, jonathan.colby@gmail.com wrote:
>>
>> >>
>>
>> >>> Seems like it had something to do with stale endpoint information. I
>> >>> did a rolling restart of the whole cluster and that seemed to trigger
the
>> >>> nodes to remove the node that was decommissioned.
>>
>> >>>
>>
>> >>> On , aaron morton aaron@thelastpickle.com> wrote:
>>
>> >>>> Is it showing progress ? It may just be a problem with the
>> >>>> information printed out.
>>
>> >>>>
>>
>> >>>>
>>
>> >>>>
>>
>> >>>> Can you check from the other nodes in the cluster to see if they
are
>> >>>> receiving the stream ?
>>
>> >>>>
>>
>> >>>>
>>
>> >>>>
>>
>> >>>> cheers
>>
>> >>>>
>>
>> >>>>
>>
>> >>>>
>>
>> >>>> -----------------
>>
>> >>>>
>>
>> >>>> Aaron Morton
>>
>> >>>>
>>
>> >>>> Freelance Cassandra Developer
>>
>> >>>>
>>
>> >>>> @aaronmorton
>>
>> >>>>
>>
>> >>>> http://www.thelastpickle.com
>>
>> >>>>
>>
>> >>>>
>>
>> >>>>
>>
>> >>>> On 26 May 2011, at 00:42, Jonathan Colby wrote:
>>
>> >>>>
>>
>> >>>>
>>
>> >>>>
>>
>> >>>>> I recently removed a node (with decommission) from our cluster.
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>> I added a couple new nodes and am now trying to rebalance the
>> >>>>> cluster using nodetool move.
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>> However,  netstats shows that the node being "moved" is trying
to
>> >>>>> stream data to the node that I already decommissioned yesterday.
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>> The removed node was powered-off, taken out of dns, its IP is
not
>> >>>>> even pingable.   It was never a seed neither.
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>> This is cassandra 0.7.5 on 64bit linux.   How do I tell the
cluster
>> >>>>> that this node is gone?  Gossip should have detected this.
 The ring
>> >>>>> commands shows the correct cluster IPs.
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>> Here is a portion of netstats. 10.46.108.102 is the node which
was
>> >>>>> removed.
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>> Mode: Leaving: streaming data to other nodes
>>
>> >>>>
>>
>> >>>>> Streaming to: /10.46.108.102
>>
>> >>>>
>>
>> >>>>>
>> >>>>> /var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97
>>
>> >>>>
>>
>> >>>>> ...................
>>
>> >>>>
>>
>> >>>>>
>> >>>>> 5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266)
>>
>> >>>>
>>
>> >>>>>       progress=280574376402/12434049900 - 2256%
>>
>> >>>>
>>
>> >>>>> .....
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>> Note 10.46.108.102 is NOT part of the ring.
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>> Address         Status State   Load            Owns
   Token
>>
>> >>>>
>>
>> >>>>>
>> >>>>> 148873535527910577765226390751398592512
>>
>> >>>>
>>
>> >>>>> 10.46.108.100   Up     Normal  71.73 GB        12.50%
 0
>>
>> >>>>
>>
>> >>>>> 10.46.108.101   Up     Normal  109.69 GB       12.50%
>> >>>>>  21267647932558653966460912964485513216
>>
>> >>>>
>>
>> >>>>> 10.47.108.100   Up     Leaving 281.95 GB       37.50%
>> >>>>>  85070591730234615865843651857942052863
>>
>> >>>>> 10.47.108.102   Up     Normal  210.77 GB       0.00%
>> >>>>> 85070591730234615865843651857942052864
>>
>> >>>>
>>
>> >>>>> 10.47.108.101   Up     Normal  289.59 GB       16.67%
>> >>>>>  113427455640312821154458202477256070484
>>
>> >>>>
>>
>> >>>>> 10.46.108.103   Up     Normal  299.87 GB       8.33%
>> >>>>> 127605887595351923798765477786913079296
>>
>> >>>>
>>
>> >>>>> 10.47.108.103   Up     Normal  94.99 GB        12.50%
>> >>>>>  148873535527910577765226390751398592511
>>
>> >>>>
>>
>> >>>>> 10.46.108.104   Up     Normal  103.01 GB       0.00%
>> >>>>> 148873535527910577765226390751398592512
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>>
>>
>> >>>>
>>
>> >>>>
>>
>> >>>>
>>
>> >>
>>
>> >
>>
>>
>>

Mime
View raw message