cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kaide Mu (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-12008) Make decommission operations resumable
Date Wed, 03 Aug 2016 16:53:20 GMT


Kaide Mu commented on CASSANDRA-12008:

New dtests patch:
New implementation patch:

bq. Error while decommissioning node is never printed because the ExecutionException is being
wrapped in a RuntimeException on unbootstrap, so perhaps you can modify unbootstrap to throw
ExecutionException | InterruptedException and catch that on decomission to wrap in RuntimeException.
Now ExecutionException | InterruptedException is handled directly by unbootstrap.

bq. When verifying if the retrieved data is correct on resumable_decommission_test, you need
to stop either node1 or node3 when querying the other otherwise the data may be in only one
of these nodes (while it must be in both nodes, since RF=2 and N=2).
Instead of stopping and starting nodes, I changed stress read with a CL=TWO, in this way I
guess we can ensure that node1 and node3 is "replying" to the request. Also if we do stop
and restart node, it seems the restarted node will raise an error due to it is looking on
node2 log that restarting node is alive which such operation is not possible since node2 is

bq. Instead of counting for decommission_error you can add a"second rebuild should
fail") after node2.nodetool('decommission') and on the except part perhaps check that the
following message is being print on logs Error while decommissioning node
I guess I'll use insetead  assertRaises which seems more suitable to ensure NodetoolError
is raised. WDYT [~pauloricardomg] [~yukim]?

> Make decommission operations resumable
> --------------------------------------
>                 Key: CASSANDRA-12008
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Streaming and Messaging
>            Reporter: Tom van der Woerdt
>            Assignee: Kaide Mu
>            Priority: Minor
> We're dealing with large data sets (multiple terabytes per node) and sometimes we need
to add or remove nodes. These operations are very dependent on the entire cluster being up,
so while we're joining a new node (which sometimes takes 6 hours or longer) a lot can go wrong
and in a lot of cases something does.
> It would be great if the ability to retry streams was implemented.
> Example to illustrate the problem :
> {code}
> 03:18 PM   ~ $ nodetool decommission
> error: Stream failed
> -- StackTrace --
> org.apache.cassandra.streaming.StreamException: Stream failed
>         at
>         at$
>         at$DirectExecutor.execute(
>         at
>         at
>         at
>         at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(
>         at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(
>         at org.apache.cassandra.streaming.StreamSession.closeSession(
>         at org.apache.cassandra.streaming.StreamSession.complete(
>         at org.apache.cassandra.streaming.StreamSession.messageReceived(
>         at org.apache.cassandra.streaming.ConnectionHandler$
>         at
> 08:04 PM   ~ $ nodetool decommission
> nodetool: Unsupported operation: Node in LEAVING state; wait for status to become normal
or restart
> See 'nodetool help' or 'nodetool help <command>'.
> {code}
> Streaming failed, probably due to load :
> {code}
> ERROR [STREAM-IN-/<ipaddr>] 2016-06-14 18:05:47,275 - [Stream
#<streamid>] Streaming error occurred
> null
>         at$ ~[na:1.8.0_77]
>         at ~[na:1.8.0_77]
>         at java.nio.channels.Channels$
>         at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(
>         at org.apache.cassandra.streaming.ConnectionHandler$
>         at [na:1.8.0_77]
> {code}
> If implementing retries is not possible, can we have a 'nodetool decommission resume'?

This message was sent by Atlassian JIRA

View raw message