Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3ED07200B3B for ; Mon, 11 Jul 2016 22:42:13 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3D505160A78; Mon, 11 Jul 2016 20:42:13 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8BE6C160A7D for ; Mon, 11 Jul 2016 22:42:12 +0200 (CEST) Received: (qmail 59219 invoked by uid 500); 11 Jul 2016 20:42:11 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 59201 invoked by uid 99); 11 Jul 2016 20:42:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jul 2016 20:42:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 728052C02C1 for ; Mon, 11 Jul 2016 20:42:11 +0000 (UTC) Date: Mon, 11 Jul 2016 20:42:11 +0000 (UTC) From: "Joel Knighton (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-9667) strongly consistent membership and ownership MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 11 Jul 2016 20:42:13 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-9667: ------------------------------------- Reviewer: Jason Brown > strongly consistent membership and ownership > -------------------------------------------- > > Key: CASSANDRA-9667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9667 > Project: Cassandra > Issue Type: New Feature > Reporter: Jason Brown > Assignee: Joel Knighton > Labels: LWT, membership, ownership > Fix For: 3.x > > > Currently, there is advice to users to "wait two minutes between adding new nodes" in order for new node tokens, et al, to propagate. Further, as there's no coordination amongst joining node wrt token selection, new nodes can end up selecting ranges that overlap with other joining nodes. This causes a lot of duplicate streaming from the existing source nodes as they shovel out the bootstrap data for those new nodes. > This ticket proposes creating a mechanism that allows strongly consistent membership and ownership changes in cassandra such that changes are performed in a linearizable and safe manner. The basic idea is to use LWT operations over a global system table, and leverage the linearizability of LWT for ensuring the safety of cluster membership/ownership state changes. This work is inspired by Riak's claimant module. > The existing workflows for node join, decommission, remove, replace, and range move (there may be others I'm not thinking of) will need to be modified to participate in this scheme, as well as changes to nodetool to enable them. > Note: we distinguish between membership and ownership in the following ways: for membership we mean "a host in this cluster and it's state". For ownership, we mean "what tokens (or ranges) does each node own"; these nodes must already be a member to be assigned tokens. > A rough draft sketch of how the 'add new node' workflow might look like is: new nodes would no longer create tokens themselves, but instead contact a member of a Paxos cohort (via a seed). The cohort member will generate the tokens and execute a LWT transaction, ensuring a linearizable change to the membership/ownership state. The updated state will then be disseminated via the existing gossip. > As for joining specifically, I think we could support two modes: auto-mode and manual-mode. Auto-mode is for adding a single new node per LWT operation, and would require no operator intervention (much like today). In manual-mode, however, multiple new nodes could (somehow) signal their their intent to join to the cluster, but will wait until an operator executes a nodetool command that will trigger the token generation and LWT operation for all pending new nodes. This will allow us better range partitioning and will make the bootstrap streaming more efficient as we won't have overlapping range requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)