cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-7816) Duplicate DOWN/UP Events Pushed with Native Protocol
Date Thu, 19 Mar 2015 05:38:39 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368543#comment-14368543
] 

Stefania commented on CASSANDRA-7816:
-------------------------------------

Unfortunately I could not reproduce it, I tried with a ccm cluster of 5, 10 and 15 nodes.
Not sure if you were doing any specific operations so I tried restarting or adding new nodes
but the status reported by nodetool, on all hosts, was always correct as "UN".

However, by code inspection, the problem described could happen if a node fails to reply to
an echo message after it has gossiped it's status as alive. Maybe the socket listening thread
was slow to start due to machine overload or maybe some other cluster properties could explain
it. I did not waste too much time trying to understand what I could not reproduce. Instead,
I went ahead and created a new delta patch that should fix it: https://github.com/stef1927/cassandra/tree/7816-2.

What this patch does is revert back the part of the code that I think is causing the issue
in favor of a more conservative approach that simply stores the last state reported to the
client in {{Server.EventNotifier}} and does not interfere with the existing {{markAlive()}}
logic in {{Gossiper}}. It does introduce a new problem, in that the additional map may consume
extra memory, but we can worry about this during code review if this patch works.

So would you mind applying this new patch and see if it solves it? If not, could you please
give me more information on your cluster or give me access to it? You could also try to reproduce
it in TRACE mode and send me the logs.

> Duplicate DOWN/UP Events Pushed with Native Protocol
> ----------------------------------------------------
>
>                 Key: CASSANDRA-7816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: API
>            Reporter: Michael Penick
>            Assignee: Stefania
>            Priority: Minor
>             Fix For: 2.1.4, 2.0.14
>
>         Attachments: 7816-v2.0.txt, tcpdump_repeating_status_change.txt, trunk-7816.txt
>
>
> Added "MOVED_NODE" as a possible type of topology change and also specified that it is
possible to receive the same event multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message