cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10231) Null status entries on nodes that crash during decommission of a different node
Date Tue, 29 Sep 2015 10:00:06 GMT


Stefania commented on CASSANDRA-10231:

I noticed two major differences between the dtest logs and the logs attached:

* after n5 restarts it logs no GOSSIP information in {{applyStateLocally}}, hence the GOSSIP
information for the decommissioned node is totally missing.
* the tokens for the decommissioned node are still there even though they were deleted before
crashing. This could be explained by the crash if the commit log is not replayed. 

Ultimately it shouldn't matter if the tokens are still there after restarting, if we received
a GOSSIP message with status LEFT we should have been able to clear them. We would need a
full TRACE log to be able to work out why the GOSSIP entries are missing: either other nodes
are not gossiping about the decommissioned node (unlikely since the expiry time is 3 days)
or for some reason node 5 ignores the GOSSIP entry for the decommissioned node.

I tried running my dtest on the same commit but I could not reproduce this. However there
is one big difference, in that the dtest does not involve any hinting or streaming of data.
So I probably need to install Jepsen.

I would suggest fixing the MV issue that is preventing us from running on the latest 3.0,
or at a minimum running on a commit where hinting works fine and the batch log can be replayed.
Also, we will probably need to run with DEBUG=true, if not TRACE=true.

Is there anything I can do to help track down the MV issue?

> Null status entries on nodes that crash during decommission of a different node
> -------------------------------------------------------------------------------
>                 Key: CASSANDRA-10231
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Joel Knighton
>            Assignee: Stefania
>             Fix For: 3.0.0 rc2
>         Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
> This issue is reproducible through a Jepsen test of materialized views that crashes and
decommissions nodes throughout the test.
> In a 5 node cluster, if a node crashes at a certain point (unknown) during the decommission
of a different node, it may start with a null entry for the decommissioned node like so:
> DN ? 256 ? null rack1
> This entry does not get updated/cleared by gossip. This entry is removed upon a restart
of the affected node.
> This issue is further detailed in ticket [10068|].

This message was sent by Atlassian JIRA

View raw message