cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data
Date Tue, 21 Jun 2016 00:30:58 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337288#comment-15337288
] 

Paulo Motta edited comment on CASSANDRA-8523 at 6/21/16 12:30 AM:
------------------------------------------------------------------

Due to the limitations of forwarding writes to replacement nodes with the same IP, I propose
initially adding this support only to replacement nodes with a different IP, since it's much
simpler and we can do it in a backward-compatible way so it can probably go on 2.2+.

After CASSANDRA-11559, we can extend this support to nodes with the same IP quite easily by
setting an inactive flag on nodes being replaced and ignore these nodes on read.

The central idea is:
{quote}
* Add a new non-dead gossip state for replace BOOT_REPLACE
* When receiving BOOT_REPLACE, other node adds the replacing node as bootstrapping endpoint
* Pending ranges are calculated, and writes are sent to the replacing node during replace
* When replacing node changes state to NORMAL, the old node is removed and the new node becomes
a natural endpoint on TokenMetadata
{quote}

Since it's no longer necessary to forward hints to the replacement node when {{replace_address
!= broadcast_address}}, the replacement node does not need to inherit the same ID of the original
node.

The replacing process remains unchanged when the replacement node has the same IP as the original
node. If that's the case, I added a warn message so users know they need to run repair if
the node is down for longer than {{max_hint_window_in_ms}}:
{noformat}
Writes will not be redirected to this node while it is performing replace because it has the
same address as the node to be replaced ({}). 
If that node has been down for longer than max_hint_window_in_ms, repair must be run after
the replacement process in order to make this node consistent.
{noformat}

I adapted current dtests to test replace_address for both the old and the new path, and when
{{replace_address != broadcast_address}} make sure writes are being redirected to the replacement
node.

Initial patch and tests below (will provide 2.2+ patches after initial review):
||2.2||dtest||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-8523]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:8523]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-8523-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-8523-dtest/lastCompletedBuild/testReport/]|


was (Author: pauloricardomg):
Due to the limitations of forwarding writes to replacement nodes with the same IP, I propose
initially adding this support only to replacement nodes with a different IP, since it's much
simpler and we can do it in a backward-compatible way so it can probably go on 2.2+.

After CASSANDRA-11559, we can extend this support to nodes with the same IP quite easily by
setting an inactive flag on nodes being replaced and ignore these nodes on read.

The central idea is:
{quote}
* Add a new non-dead gossip state for replace BOOT_REPLACE
* When receiving BOOT_REPLACE, other node adds the replacing node as bootstrapping endpoint
* Pending ranges are calculated, and writes are sent to the replacing node during replace
* When replacing node changes state to NORMAL, the old node is removed and the new node becomes
a natural endpoint on TokenMetadata
* The final step is to change the original node state to REMOVED_TOKEN so other nodes evict
the original node from gossip
{quote}

Since it's no longer necessary to forward hints to the replacement node when {{replace_address
!= broadcast_address}}, the replacement node does not need to inherit the same ID of the original
node.

The replacing process remains unchanged when the replacement node has the same IP as the original
node. If that's the case, I added a warn message so users know they need to run repair if
the node is down for longer than {{max_hint_window_in_ms}}:
{noformat}
Writes will not be redirected to this node while it is performing replace because it has the
same address as the node to be replaced ({}). 
If that node has been down for longer than max_hint_window_in_ms, repair must be run after
the replacement process in order to make this node consistent.
{noformat}

I adapted current dtests to test replace_address for both the old and the new path, and when
{{replace_address != broadcast_address}} make sure writes are being redirected to the replacement
node.

Initial patch and tests below (will provide 2.2+ patches after initial review):
||2.2||dtest||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-8523]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:8523]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-8523-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-8523-dtest/lastCompletedBuild/testReport/]|

> Writes should be sent to a replacement node while it is streaming in data
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8523
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Richard Wagner
>            Assignee: Paulo Motta
>             Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or replace_address_first_boot)
in order to replace broken nodes. We now realize that writes are not sent to the replacement
nodes while they are in hibernate state and streaming in data. This runs counter to what our
expectations were, especially since we know that writes ARE sent to nodes when they are bootstrapped
into the ring.
> It seems like cassandra should arrange to send writes to a node that is in the process
of replacing another node, just like it does for a nodes that are bootstraping. I hesitate
to phrase this as "we should send writes to a node in hibernate" because the concept of hibernate
may be useful in other contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period makes subsequent
repairs more expensive, proportional to the number of writes that we miss (and depending on
the amount of data that needs to be streamed during replacement and the time it may take to
rebuild secondary indexes, we could miss many many hours worth of writes). It also leaves
us more exposed to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message