cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kaide Mu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12008) Make decommission operations resumable
Date Sun, 24 Jul 2016 19:07:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391154#comment-15391154
] 

Kaide Mu commented on CASSANDRA-12008:
--------------------------------------

A working skeleton is now available, you check it via https://github.com/apache/cassandra/compare/trunk...kdmu:trunk-12008

Now we can relaunch decommission if it failed previously and {{AvailableRanges}} will be truncated
according to {{OperationMode}}, but it seems {{StreamStateStore}} is not recording properly
transferred ranges (nothing is recorded), I did a test with 3 nodes, decommissioning node2
and interrupting node1, i.e only node3 stream session will be completed, after partial decommission
I manually checked {{AvailableRanges}} with {{SELECT * FROM system.available_ranges;}} which
returned an empty result. I guess everything is set-up correctly, would you mind to taking
a look?

Thanks.

> Make decommission operations resumable
> --------------------------------------
>
>                 Key: CASSANDRA-12008
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12008
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Streaming and Messaging
>            Reporter: Tom van der Woerdt
>            Assignee: Kaide Mu
>            Priority: Minor
>
> We're dealing with large data sets (multiple terabytes per node) and sometimes we need
to add or remove nodes. These operations are very dependent on the entire cluster being up,
so while we're joining a new node (which sometimes takes 6 hours or longer) a lot can go wrong
and in a lot of cases something does.
> It would be great if the ability to retry streams was implemented.
> Example to illustrate the problem :
> {code}
> 03:18 PM   ~ $ nodetool decommission
> error: Stream failed
> -- StackTrace --
> org.apache.cassandra.streaming.StreamException: Stream failed
>         at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
>         at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
>         at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
>         at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
>         at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
>         at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
>         at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:210)
>         at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186)
>         at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:430)
>         at org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:622)
>         at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:486)
>         at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:274)
>         at java.lang.Thread.run(Thread.java:745)
> 08:04 PM   ~ $ nodetool decommission
> nodetool: Unsupported operation: Node in LEAVING state; wait for status to become normal
or restart
> See 'nodetool help' or 'nodetool help <command>'.
> {code}
> Streaming failed, probably due to load :
> {code}
> ERROR [STREAM-IN-/<ipaddr>] 2016-06-14 18:05:47,275 StreamSession.java:520 - [Stream
#<streamid>] Streaming error occurred
> java.net.SocketTimeoutException: null
>         at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) ~[na:1.8.0_77]
>         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[na:1.8.0_77]
>         at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
~[na:1.8.0_77]
>         at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:54)
~[apache-cassandra-3.0.6.jar:3.0.6]
>         at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:268)
~[apache-cassandra-3.0.6.jar:3.0.6]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
> {code}
> If implementing retries is not possible, can we have a 'nodetool decommission resume'?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message