Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4ADE818772 for ; Tue, 6 Oct 2015 16:08:27 +0000 (UTC) Received: (qmail 39410 invoked by uid 500); 6 Oct 2015 16:08:27 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 39378 invoked by uid 500); 6 Oct 2015 16:08:27 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 39362 invoked by uid 99); 6 Oct 2015 16:08:27 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Oct 2015 16:08:27 +0000 Date: Tue, 6 Oct 2015 16:08:27 +0000 (UTC) From: "Joshua McKenzie (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-10413) Replaying materialized view updates from commitlog after node decommission crashes Cassandra MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua McKenzie updated CASSANDRA-10413: ---------------------------------------- Reviewer: Joel Knighton > Replaying materialized view updates from commitlog after node decommission crashes Cassandra > -------------------------------------------------------------------------------------------- > > Key: CASSANDRA-10413 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10413 > Project: Cassandra > Issue Type: Bug > Reporter: Joel Knighton > Assignee: T Jake Luciani > Priority: Critical > Fix For: 3.0.0 rc2 > > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test, runnable as > {code} > lein with-profile +trunk test :only cassandra.mv-test/mv-crash-subset-decommission > {code} > This test crashes/restarts nodes while decommissioning nodes. These actions are not coordinated. > In [10164|https://issues.apache.org/jira/browse/CASSANDRA-10164], we introduced a change to re-apply materialized view updates on commitlog replay. > Some nodes, upon restart, will crash in commitlog replay. They throw the "Trying to get the view natural endpoint on a non-data replica" runtime exception in getViewNaturalEndpoint. I added logging to getViewNaturalEndpoint to show the results of replicationStrategy.getNaturalEndpoints for the baseToken and viewToken. > It can be seen that these problems occur when the baseEndpoints and viewEndpoints are identical but do not contain the broadcast address of the local node. > For example, a node at 10.0.0.5 crashes on replay of a write whose base token and view token replicas are both [10.0.0.2, 10.0.0.4, 10.0.0.6]. It seems we try to guard against this by considering pendingEndpoints for the viewToken, but this does not appear to be sufficient. > I've attached the system.logs for a test run with added logging. In the attached logs, n1 is at 10.0.0.2, n2 is at 10.0.0.3, and so on. 10.0.0.6/n5 is the decommissioned node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)