Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B1A3D18551 for ; Wed, 19 Aug 2015 18:20:49 +0000 (UTC) Received: (qmail 96216 invoked by uid 500); 19 Aug 2015 18:20:46 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 96138 invoked by uid 500); 19 Aug 2015 18:20:46 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 96020 invoked by uid 99); 19 Aug 2015 18:20:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Aug 2015 18:20:46 +0000 Date: Wed, 19 Aug 2015 18:20:46 +0000 (UTC) From: "Joel Knighton (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703510#comment-14703510 ] Joel Knighton commented on CASSANDRA-10068: ------------------------------------------- I haven't had any luck repro-ing this with a dtest - the timing issues are too difficult. I've narrowed down the cause slightly (maybe?) through watching Jepsen tests that reproduce the issue. The null gossip entries are present in nodes that crash at a particular time (seems to be quite late) in the decommission of the node. When started (after the decommission has finished without an error present), they have the null entry. A restart removes this null entry. Hope this helps. > Batchlog replay fails with exception after a node is decommissioned > ------------------------------------------------------------------- > > Key: CASSANDRA-10068 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10068 > Project: Cassandra > Issue Type: Bug > Reporter: Joel Knighton > Assignee: Marcus Eriksson > Fix For: 3.0 beta 2 > > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test of materialized views that crashes and decommissions nodes throughout the test. > At the conclusion of the test, a batchlog replay is initiated through nodetool and hits the following assertion due to a missing host ID: https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197 > A nodetool status on the node with failed batchlog replay shows the following entry for the decommissioned node: > DN 10.0.0.5 ? 256 ? null rack1 > On the unaffected nodes, there is no entry for the decommissioned node as expected. > There are occasional hits of the same assertions for logs in other nodes; it looks like the issue might occasionally resolve itself, but one node seems to have the errant null entry indefinitely. > In logs for the nodes, this possibly unrelated exception also appears: > java.lang.RuntimeException: Trying to get the view natural endpoint on a non-data replica > at org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] > I have a running cluster with the issue on my machine; it is also repeatable. > Nothing stands out in the logs of the decommissioned node (n4) for me. The logs of each node in the cluster are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)