cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Knighton (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-10413) Replaying materialized view updates from commitlog after node decommission crashes Cassandra
Date Tue, 29 Sep 2015 19:23:05 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14935687#comment-14935687
] 

Joel Knighton edited comment on CASSANDRA-10413 at 9/29/15 7:22 PM:
--------------------------------------------------------------------

Rerunning these tests while logging pending endpoints for the base and view tokens, it appears
that there are no pending endpoints for these tokens at the time of the crash.

It's worth noting that some nodes sometimes hit errors like:
{code}
ERROR [SharedPool-Worker-3] 2015-09-29 19:05:10,263 FailureDetector.java:216 - unknown endpoint
/10.0.0.3 
{code}

on commitlog replay, when 10.0.0.3 should be a healthy member of the cluster. It is likely
a separate issue but indicates the gossip state of the cluster might not be well.


was (Author: jkni):
Rerunning these tests while logging pending endpoints for the base and view tokens, it appears
that there are no pending endpoints for these tokens at the time of the crash.

It's worth noting that some nodes sometimes hit errors like:
{code}
ERROR [SharedPool-Worker-3] 2015-09-29 19:05:10,263 FailureDetector.java:216 - unknown endpoint
/10.0.0.3 
{code}

on commitlog replay, when 10.0.0.3 should be a healthy member of the cluster. It is a separate
issue but indicates the gossip state of the cluster might not be well.

> Replaying materialized view updates from commitlog after node decommission crashes Cassandra
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10413
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10413
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Joel Knighton
>            Assignee: T Jake Luciani
>            Priority: Critical
>             Fix For: 3.0.0 rc2
>
>         Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test, runnable as
> {code}
> lein with-profile +trunk test :only cassandra.mv-test/mv-crash-subset-decommission
> {code}
> This test crashes/restarts nodes while decommissioning nodes. These actions are not coordinated.
> In [10164|https://issues.apache.org/jira/browse/CASSANDRA-10164], we introduced a change
to re-apply materialized view updates on commitlog replay.
> Some nodes, upon restart, will crash in commitlog replay. They throw the "Trying to get
the view natural endpoint on a non-data replica" runtime exception in getViewNaturalEndpoint.
I added logging to getViewNaturalEndpoint to show the results of replicationStrategy.getNaturalEndpoints
for the baseToken and viewToken.
> It can be seen that these problems occur when the baseEndpoints and viewEndpoints are
identical but do not contain the broadcast address of the local node.
> For example, a node at 10.0.0.5 crashes on replay of a write whose base token and view
token replicas are both [10.0.0.2, 10.0.0.4, 10.0.0.6]. It seems we try to guard against this
by considering pendingEndpoints for the viewToken, but this does not appear to be sufficient.
> I've attached the system.logs for a test run with added logging. In the attached logs,
n1 is at 10.0.0.2, n2 is at 10.0.0.3, and so on. 10.0.0.6/n5 is the decommissioned node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message