cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10231) Null status entries on nodes that crash during decommission of a different node
Date Sat, 19 Sep 2015 22:16:04 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14877334#comment-14877334
] 

Stefania commented on CASSANDRA-10231:
--------------------------------------

I've attached [a patch|https://github.com/stef1927/cassandra/commits/10231-3.0] for 3.0 that
fixes the dtest, [~jkni] would you mind trying the patch with your Jepsen test? It will also
log a message for all GOSSIP entries so the log files may get a bit bigger but we will have
helpful information should the patch not work.

The patch basically fixes this exception, which causes any other GOSSIP properties applied
by {{onChange}} and following STATUS to be skipped:

{code}
ERROR [GossipStage:2] 2015-09-19 14:14:31,007 CassandraDaemon.java:195 - Exception in thread
Thread[GossipStage:2,5,main]
java.lang.NullPointerException: null
        at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) ~[na:1.8.0_60]
        at org.apache.cassandra.hints.HintsCatalog.get(HintsCatalog.java:85) ~[main/:na]
        at org.apache.cassandra.hints.HintsService.excise(HintsService.java:263) ~[main/:na]
        at org.apache.cassandra.service.StorageService.excise(StorageService.java:2166) ~[main/:na]
        at org.apache.cassandra.service.StorageService.excise(StorageService.java:2178) ~[main/:na]
        at org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:2083)
~[main/:na]
        at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1672)
~[main/:na]
        at org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1220) ~[main/:na]
        at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:1202) ~[main/:na]
        at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1159) ~[main/:na]
        at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
~[main/:na]
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) ~[main/:na]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_60]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
~[na:1.8.0_60]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_60]
{code}

According to the additional GOSSIP trace message that I've added, host id was one such property.

> Null status entries on nodes that crash during decommission of a different node
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10231
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10231
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Joel Knighton
>            Assignee: Stefania
>             Fix For: 3.0.x
>
>
> This issue is reproducible through a Jepsen test of materialized views that crashes and
decommissions nodes throughout the test.
> In a 5 node cluster, if a node crashes at a certain point (unknown) during the decommission
of a different node, it may start with a null entry for the decommissioned node like so:
> DN 10.0.0.5 ? 256 ? null rack1
> This entry does not get updated/cleared by gossip. This entry is removed upon a restart
of the affected node.
> This issue is further detailed in ticket [10068|https://issues.apache.org/jira/browse/CASSANDRA-10068].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message