Hello Alain,

I solved this with a brute force solution, but didn't understand exactly what happened behind the scenes. What I did was:

a) removed the failed node from the ring with the unsafeAssassinate JMX option.
b) this caused requests to that node to be routed to the following node which didn't have the data, so in order to fix the problem I inserted a new dummy node with the same token as the failed node, but with "autobootstrap=false"
c) after the node joined the ring again, I did a clean shutdown with
nodetool -h localhost disablethrift 
nodetool -h localhost disablegossip && sleep 10
nodetool -h localhost drain
d) restart the bootstrap process again in the new node.

But in our case, our cluster was not using VNodes, so this workaround will probably not work with VNodes, since you cannot specify the 256 tokens from the old node.

This really seem like some kind of metadata inconsistency in gossip, so you probably should check if your nodetool gossipinfo shows a node that's not supposed to be in the ring and unsafeAssassinate it. This post has more info about it: http://nartax.com/2012/09/assassinate-cassandra-node/

But be careful to know what you're doing, as this can be a dangerous operation.

Good luck!

Cheers,

Paulo

 


On Fri, Feb 14, 2014 at 11:17 AM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:
Hi Paulo,

Did you find out how to fix this issue ? I am experimenting the exact same issue after trying to help you on this exact subject a few days ago :).

Config : 32 C*1.2.11 nodes, Vnodes enabled, RF=3, 1 DC, On AWS EC2 m1.xlarge.

We added a few nodes (4) and it seems that this occurs on one node out of two...

INFO 12:52:16,889 Finished streaming session d5e4d014-9558-11e3-950d-cd6aba92807e from /xxx.xxx.xxx.xxx
java.lang.RuntimeException: Unable to fetch range [(20078703525355016727168231761171377180,20105424945623564908585534414693308183], (129753652951782325468767616123724624016,129754698153613057562227134647005586420], (4499106157406300244131405400767888838,4524540663392564361402125588359485564], (122461441134035840782923349842361962551,122462803389597917496737056756119104930], (107970238065835199457922160357012606207,107987706615224138615506976884972465320], (129754698153613057562227134647005586420,129760990520285412763184172827801136526], (38338043252657275110873170917842646549,38368318768493907804399955985800320618], (42022774431506526693485667522039962965,42053289032932587102300879230918436885], (66836265760288088017242608238099612345,66844191330959602627129212011239690831], (52540232739182066369547232798226785314,52559117354438503565212218200939569114], (145046787539667961591986998676504957238,145057153206926436867917708334845130444], (108279691586280658015556401795266720050,108305470056478513440634738885678702409], (40039571254531814244837067525035822613,40053379084508254942645157728035688263], (132027653159543236812527609067336099062,132029648290617316887203744857701890860], (52516518106546460227349801041398186304,52540232739182066369547232798226785314], (151797253868519929321029931533765036527,151828244658375264200603444399788004805], (145057153206926436867917708334845130444,145084033851007428646660791831082771964], (107963567982152736714636832273817259428,107970238065835199457922160357012606207]] for keyspace foo_bar from any hosts


at org.apache.cassandra.dht.RangeStreamer.fetch(RangeStreamer.java:260)
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:84)
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:973)
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:740)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
at org.apache.cassandra.service.CassandraDaemon.init(CassandraDaemon.java:381)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:212)

Cannot load daemon

Service exit with a return value of 3

Hope you'll be able to help me on this one :)



2014-02-07 19:24 GMT+01:00 Robert Coli <rcoli@eventbrite.com>:

On Fri, Feb 7, 2014 at 4:41 AM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:
From changelog :


1.2.15
 * Move handling of migration event source to solve bootstrap race (CASSANDRA-6648)
Maybe should you give this new version a try, if you suspect your issue to be related to CASSANDRA-6648.
6648 appears to have been introduced in 1.2.14, by :


So it should only affect 1.2.14.

=Rob





--
Paulo Motta

Chaordic | Platform
www.chaordic.com.br
+55 48 3232.3200
+55 83 9690-1314