cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Eriksson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
Date Fri, 14 Aug 2015 11:16:45 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696884#comment-14696884
] 

Marcus Eriksson commented on CASSANDRA-10068:
---------------------------------------------

bq. java.lang.RuntimeException: Trying to get the view natural endpoint on a non-data replica
this is due to the fact that while we are decommissioning, the leaving node is still in TokenMetadata
so the nodes receiving the rows don't think they should own them. Patch here: https://github.com/krummas/cassandra/commits/marcuse/10068
that solves that. DTest here: https://github.com/krummas/cassandra-dtest/commits/marcuse/10068

[~jkni] I doubt this is related to the other errors you are seeing so I will keep looking
for that, but could you rerun the test just to make sure it is not related?

> Batchlog replay fails with exception after a node is decommissioned
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-10068
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Joel Knighton
>            Assignee: Marcus Eriksson
>             Fix For: 3.0.0 rc1
>
>         Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that crashes and
decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through nodetool and hits
the following assertion due to a missing host ID: https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following entry for
the decommissioned node:
> DN  10.0.0.5  ?          256          ?       null                                  rack1
> On the unaffected nodes, there is no entry for the decommissioned node as expected.
> There are occasional hits of the same assertions for logs in other nodes; it looks like
the issue might occasionally resolve itself, but one node seems to have the errant null entry
indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a non-data replica
> 	at org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I have a running cluster with the issue on my machine; it is also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The logs of each
node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message