cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-9667) strongly consistent membership and ownership
Date Sat, 27 Jun 2015 14:14:04 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604161#comment-14604161
] 

Jason Brown edited comment on CASSANDRA-9667 at 6/27/15 2:13 PM:
-----------------------------------------------------------------

As a back of the envelope example, the workflow for adding a new node could look something
like this:

auto-join mode (for a single node)
# new node comes up, contacts seed node for membership/ownership info
** node needs to contact seed as new node does node have existing membership info (the seed
does)
 # seed will immediately return current membership and ownership info, but no tokens for the
new node (see next steps)
 # seed will generate tokens (as per new TokenAllocation class)
 # run LWT 'claiming operation' (see elsewhere in this gist)
 # "broadcast" out the update to existing cluster nodes
 ** existing gossip is probably sufficient for this
 # seed sends tokens to new node, at which point the new node can start bootstrap streaming
data from peers
 ** an alternative could be to let new node learn about it's tokens via gossip

manual join (can be done for one or multiple nodes)
 # new node(s) come up, contact seed(s) to let them know we are joining but want to be part
of a transaction
 # seed will immediately return current membership and ownership info, but no tokens for the
new node (see next steps)
 # operator executes "nodetool joinall" on any existing node. optionally, operator can pass
in explicit IP addresses to be added(?)
    1. seed will generate tokens (as per new TokenAllocation class)
    2. run LWT 'claiming operation' (see elsewhere in this gist)
    3. "broadcast" out the update to existing cluster nodes
 # seed sends tokens to new node, at which point the new node can start bootstrap streaming
data from peers 



was (Author: jasobrown):
As a back of the envelope example, the workflow for adding a new node could look something
like this:

auto-join mode (for a single node)
# new node comes up, contacts seed node for membership/ownership info
** node needs to contact seed as new node does node have existing membership info (the seed
does)
 # seed will immediately return current membership and ownership info, but no tokens for the
new node (see next steps)
 # seed will generate tokens (as per new TokenAllocation class)
 # run LWT 'claiming operation' (see elsewhere in this gist)
 # "broadcast" out the update to existing cluster nodes
    * existing gossip is probably sufficient for this
 # seed sends tokens to new node, at which point the new node can start bootstrap streaming
data from peers
    * an alternative could be to let new node learn about it's tokens via gossip

manual join (can be done for one or multiple nodes)
 # new node(s) come up, contact seed(s) to let them know we are joining but want to be part
of a transaction
 # seed will immediately return current membership and ownership info, but no tokens for the
new node (see next steps)
 # operator executes "nodetool joinall" on any existing node. optionally, operator can pass
in explicit IP addresses to be added(?)
    1. seed will generate tokens (as per new TokenAllocation class)
    2. run LWT 'claiming operation' (see elsewhere in this gist)
    3. "broadcast" out the update to existing cluster nodes
 # seed sends tokens to new node, at which point the new node can start bootstrap streaming
data from peers 


> strongly consistent membership and ownership
> --------------------------------------------
>
>                 Key: CASSANDRA-9667
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9667
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>              Labels: LWT, membership, ownership
>             Fix For: 3.x
>
>
> Currently, there is advice to users to "wait two minutes between adding new nodes" in
order for new node tokens, et al, to propagate. Further, as there's no coordination amongst
joining node wrt token selection, new nodes can end up selecting ranges that overlap with
other joining nodes. This causes a lot of duplicate streaming from the existing source nodes
as they shovel out the bootstrap data for those new nodes.
> This ticket proposes creating a mechanism that allows strongly consistent membership
and ownership changes in cassandra such that changes are performed in a linearizable and safe
manner. The basic idea is to use LWT operations over a global system table, and leverage the
linearizability of LWT for ensuring the safety of cluster membership/ownership state changes.
This work is inspired by Riak's claimant module.
> The existing workflows for node join, decommission, remove, replace, and range move (there
may be others I'm not thinking of) will need to be modified to participate in this scheme,
as well as changes to nodetool to enable them.
> Note: we distinguish between membership and ownership in the following ways: for membership
we mean "a host in this cluster and it's state". For ownership, we mean "what tokens (or ranges)
does each node own"; these nodes must already be a member to be assigned tokens.
> A rough draft sketch of how the 'add new node' workflow might look like is: new nodes
would no longer create tokens themselves, but instead contact a member of a Paxos cohort (via
a seed). The cohort member will generate the tokens and execute a LWT transaction, ensuring
a linearizable change to the membership/ownership state. The updated state will then be disseminated
via the existing gossip.
> As for joining specifically, I think we could support two modes: auto-mode and manual-mode.
Auto-mode is for adding a single new node per LWT operation, and would require no operator
intervention (much like today). In manual-mode, however, multiple new nodes could (somehow)
signal their their intent to join to the cluster, but will wait until an operator executes
a nodetool command that will trigger the token generation and LWT operation for all pending
new nodes. This will allow us better range partitioning and will make the bootstrap streaming
more efficient as we won't have overlapping range requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message