incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Bailey <n...@riptano.com>
Subject Re: Errors when decommissioning - 0.7 RC1
Date Wed, 15 Dec 2010 14:46:04 GMT
Your logs seem to indicate you have some node flapping problems.

The logs complain about nodes *.18 and *.17 going down and immediately
coming back up, even before the decommission.  Is that common throughout
your logs?  If thats the case I would attribute seeing *.17 down after the
decommission to that.

I also have a question about the log from the additional node. The log
indicates that *.19 was not a part of the cluster until you ran the
removetoken command.  Can you verify that the ring output for that node was
what you saw *before* you ran the removetoken command?

On Wed, Dec 15, 2010 at 8:34 AM, Nick Bailey <nick@riptano.com> wrote:

> Just realized the ring output is included in the logs for both of those
> nodes.  Disregard my earlier request :).
>
>
> On Wed, Dec 15, 2010 at 8:27 AM, Nick Bailey <nick@riptano.com> wrote:
>
>> This is rc2 I am assuming?
>>
>> One thing about remove, the removetoken force command is meant to be run
>> on the node that originally started a remove and doesn't take a token
>> parameter.  Not relevant to you problem though.
>>
>> Is this a test cluster and have you tried to reproduce the error? I would
>> be interested to know what the ring command looks like on both *.19 and *.17
>> after the decommission is run.  I assume you were running the ring command
>> on another node?  I'll look into the logs more and see if anything jumps
>> out.
>>
>>
>> On Wed, Dec 15, 2010 at 6:37 AM, Dan Hendry <dan.hendry.junk@gmail.com>wrote:
>>
>>> I am seeing very strange things when trying to decommission a node in my
>>> cluster (detailed logs attached). Here is a nodetool ring report **after
>>> ** decommissioning of node 192.168.4.19  (as seen by any other, properly
>>> functioning node).
>>>
>>>
>>>
>>> 192.168.4.15    Up     Normal  49.9 GB         25.00%
>>> 42535295865117307932921825928971026431
>>>
>>> 192.168.4.20    Up     Normal  42.56 GB        8.33%
>>> 56713727820156410577229101238628035242
>>>
>>> 192.168.4.16    Up     Normal  29.17 GB        16.67%
>>> 85070591730234615865843651857942052863
>>>
>>> 192.168.4.19    Down   Leaving 54.11 GB        16.67%
>>> 113427455640312821154458202477256070484
>>>
>>> 192.168.4.17    Down   Normal  48.88 GB        8.33%
>>> 127605887595351923798765477786913079295
>>>
>>> 192.168.4.18    Up     Normal  59.44 GB        25.00%
>>> 170141183460469231731687303715884105726
>>>
>>> 192.168.4.12    Up     Normal  52.3 GB         0.00%
>>> 170141183460469231731687303715884105727
>>>
>>>
>>>
>>>
>>>
>>> What I am seeing is that after nodetool decommission completes on
>>> 192.168.4.19, the next node in the ring (192.168.4.17) ‘dies’ (see attached
>>> log, its nodetool ring report is quite different). By ‘dies’ I mean that
it
>>> stops communicating with other nodes (but the Cassandra process is still
>>> running and, among other things, compaction continues). After restarting
>>> Cassandra on 192.168.4.17, the ring state gets stuck and the decommissioned
>>> node (192.168.4.19) does not get removed (at least from the nodetool ring
>>> report):
>>>
>>>
>>>
>>> 192.168.4.15    Up     Normal  49.9 GB         25.00%
>>> 42535295865117307932921825928971026431
>>>
>>> 192.168.4.20    Up     Normal  42.56 GB        8.33%
>>> 56713727820156410577229101238628035242
>>>
>>> 192.168.4.16    Up     Normal  29.17 GB        16.67%
>>> 85070591730234615865843651857942052863
>>>
>>> 192.168.4.19    Down   Leaving 54.11 GB        16.67%
>>> 113427455640312821154458202477256070484
>>>
>>> 192.168.4.17    Up     Normal  69.12 GB        8.33%
>>> 127605887595351923798765477786913079295
>>>
>>> 192.168.4.18    Up     Normal  58.88 GB        25.00%
>>> 170141183460469231731687303715884105726
>>>
>>> 192.168.4.12    Up     Normal  52.3 GB         0.00%
>>> 170141183460469231731687303715884105727
>>>
>>>
>>>
>>>
>>>
>>> Furthermore, when I try running “nodetool removetoken
>>> 113427455640312821154458202477256070484”, I get:
>>>
>>>
>>>
>>> Exception in thread "main" java.lang.UnsupportedOperationException: Node
>>> /192.168.4.19 is already being removed.
>>>
>>>                 at
>>> org.apache.cassandra.service.StorageService.removeToken(StorageService.java:1731)
>>>
>>>                 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>>
>>>                 at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>
>>>                 at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>
>>>                 at java.lang.reflect.Method.invoke(Method.java:597)
>>>
>>>                 at
>>> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
>>>
>>>
>>>
>>>
>>>
>>> And when I try running “nodetool removetoken force
>>> 113427455640312821154458202477256070484”, I get:
>>>
>>>
>>>
>>> RemovalStatus: No token removals in process.
>>>
>>> Exception in thread "main" java.lang.NullPointerException
>>>
>>>                 at
>>> org.apache.cassandra.service.StorageService.forceRemoveCompletion(StorageService.java:1703)
>>>
>>>                 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>>
>>>                 at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>
>>>                 at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>
>>>                 at java.lang.reflect.Method.invoke(Method.java:597)
>>>
>>>                 at
>>> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
>>>
>>>
>>>
>>> ?!?!?!?
>>>
>>>
>>>
>>> I think have seen this type of behaviour once or twice before (I believe
>>> 0.7 b1 or later) but wrote it off as being caused by my misguided tinkering
>>> and/or other Cassandra bugs. This time around, I have done very little with
>>> JMX/CLI/nodetool and I can find no related Cassandra bugs.
>>>
>>>
>>>
>>> Help/suggestions?
>>>
>>>
>>>
>>> Dan Hendry
>>>
>>> (403) 660-2297
>>>
>>>
>>>
>>
>>
>

Mime
View raw message