geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GEODE-6244) Healthy member kicked out by Sick member when final-check fails
Date Mon, 04 Feb 2019 17:42:00 GMT

    [ https://issues.apache.org/jira/browse/GEODE-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760056#comment-16760056
] 

ASF subversion and git services commented on GEODE-6244:
--------------------------------------------------------

Commit f8c69d2b647edf7b3e9f93446a39e381fe3b70d9 in geode's branch refs/heads/develop from
Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=f8c69d2 ]

GEODE-6244 Healthy member kicked out by Sick member when final-check fails

The initial fix caused a problem that prevented election of a new
membership coordinator in a certain case.  The case was a view
with nodes [A, B, C, D, E] where C was the coordinator.  Node A had
crashed and the crash had been detected by B.  Node C then left the
cluster, sending a Leave message to B.  B's JoinLeave did not know about
the HealthMonitor's decision that A was crashed and did not become the
new coordinator.

This commit makes B's JoinLeave pay attention to the crashed-member set
in the HealthMonitor when deciding whether to become the membership
coordinator for the cluster.


> Healthy member kicked out by Sick member when final-check fails
> ---------------------------------------------------------------
>
>                 Key: GEODE-6244
>                 URL: https://issues.apache.org/jira/browse/GEODE-6244
>             Project: Geode
>          Issue Type: New Feature
>          Components: membership
>    Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0
>            Reporter: Bruce Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.9.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> I observed this in a user's logs & can't include artifacts:  Clients were herding
to one server when another server was being slow to return results.  The clients caused the
server to run out of file descriptors because the descriptor limit was set pretty low.  When
that happened the server had trouble forming an outgoing tcp/ip connection to another server.
 It tried using MembershipManager.verifyMember() which also failed to connect to the other
server.  When that happened it sent a RemoveMessage to the locators and several of the other
servers, including the one it couldn't connect to.  That server immediately shut itself down.
> MembershipManager.verifyMember() is documented to only initiate suspect processing on
the target, not initiate immediate removal.  This is supposed to be done so that some other
process (i.e., the membership coordinator) will do additional checking on the suspect in case
the initiator is itself sick.  That was the case in this situation.
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends RemoveMember message to locators and serverB
> serverB shuts itself down (ForcedDisconnect)
> The behavior should instead be
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends SuspectMember message to locators & other servers
> coordinator performs tcp/ip and heartbeat check on the suspect
> coordinator determines suspect is available
> This is all due to the checkMember call in GMSMembershipManager passing _true_ for the
_initiateRemoval_ parameter.  It should be passing _false_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message