cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Knighton (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
Date Mon, 28 Sep 2015 17:46:04 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joel Knighton updated CASSANDRA-10068:
--------------------------------------
    Description: 
This issue is reproducible through a Jepsen test of materialized views that crashes and decommissions
nodes throughout the test.

At the conclusion of the test, a batchlog replay is initiated through nodetool and hits the
following assertion due to a missing host ID: https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197

A nodetool status on the node with failed batchlog replay shows the following entry for the
decommissioned node:
DN  10.0.0.5  ?          256          ?       null                                  rack1

On the unaffected nodes, there is no entry for the decommissioned node as expected.

There are occasional hits of the same assertions for logs in other nodes; it looks like the
issue might occasionally resolve itself, but one node seems to have the errant null entry
indefinitely.

In logs for the nodes, this possibly unrelated exception also appears:
java.lang.RuntimeException: Trying to get the view natural endpoint on a non-data replica
	at org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]

I havereddit.com/r/androidwear a running cluster with the issue on my machine; it is also
repeatable.

Nothing stands out in the logs of the decommissioned node (n4) for me. The logs of each node
in the cluster are attached.



  was:
This issue is reproducible through a Jepsen test of materialized views that crashes and decommissions
nodes throughout the test.

At the conclusion of the test, a batchlog replay is initiated through nodetool and hits the
following assertion due to a missing host ID: https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197

A nodetool status on the node with failed batchlog replay shows the following entry for the
decommissioned node:
DN  10.0.0.5  ?          256          ?       null                                  rack1

On the unaffected nodes, there is no entry for the decommissioned node as expected.

There are occasional hits of the same assertions for logs in other nodes; it looks like the
issue might occasionally resolve itself, but one node seems to have the errant null entry
indefinitely.

In logs for the nodes, this possibly unrelated exception also appears:
java.lang.RuntimeException: Trying to get the view natural endpoint on a non-data replica
	at org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]

I have a running cluster with the issue on my machine; it is also repeatable.

Nothing stands out in the logs of the decommissioned node (n4) for me. The logs of each node
in the cluster are attached.




> Batchlog replay fails with exception after a node is decommissioned
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-10068
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Joel Knighton
>            Assignee: Branimir Lambov
>         Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that crashes and
decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through nodetool and hits
the following assertion due to a missing host ID: https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following entry for
the decommissioned node:
> DN  10.0.0.5  ?          256          ?       null                                  rack1
> On the unaffected nodes, there is no entry for the decommissioned node as expected.
> There are occasional hits of the same assertions for logs in other nodes; it looks like
the issue might occasionally resolve itself, but one node seems to have the errant null entry
indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a non-data replica
> 	at org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I havereddit.com/r/androidwear a running cluster with the issue on my machine; it is
also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The logs of each
node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message