geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <>
Subject [jira] [Commented] (GEODE-77) Replace JGroups 2.2.9
Date Thu, 01 Oct 2015 17:28:26 GMT


ASF subversion and git services commented on GEODE-77:

Commit e3e794d7bd47f3573cb966ba08f16f0a5a385d99 in incubator-geode's branch refs/heads/feature/GEODE-77
from [~bschuchardt]
[;h=e3e794d ]

GEODE-77 fixes for network failure handling & miscellaneous unit test failures

Network failure handling was not properly shutting down TCPConduit, leaving threads hanging
trying to send messages.  The shutdown code was calling Services.emergencyClose too soon,
and the recursion back into GMSMembershipManager shutdown code caused some problems, too.

GMSHealthMonitor was continually switching between two members to watch even though it had
already sent suspect messages about them and had received no response.  I added a collection
of IDs that are in this state and modified setNextNeighbor to avoid reusing them.

GMSHealthMonitor was sending removeMember messages to the locators and a random member, but
for some reason this wasn't resolving a network partition fast enough.  I've disabled that
behavior for now, sending the messages to all members.  This needs to be revisited because
sending the message to all members is not scalable.

GMSHealthMonitor had some issues with initiating removals when it was in the process of shutting
down.  I added some isStopping checks to fix this.

MembershipJUnitTest and StatRecorderJUnitTest were failing in gradle runs but not under Eclipse
because my Eclipse launch configuration wasn't set to enable assertions.  After fixing that
I found a number of problems with these tests and fixed them.

Multicast tests are now implemented in GMSMembershipManager and JGroupsMessenger.  This leverages
the ping/pong messaging added for the quorum checker.

GMSJoinLeave was too slow in sending out new views when there were process failures.  I added
code to inform the reply processor if there are queued leave/remove requests so it wouldn't
wait for these, and also added similar checks in the removeHealthyMembers method (which performs
checks on members using the HealthMonitor).

When there is a network partition GMSJoinLeave will now send a NetworkPartitionMessage to
other members to prod them along in figuring out that they should shut down.

During a forced-disconnect there can be a lot of warning/fatail log messages.  If there are
alert listeners in the system this can create a lot of network traffic and extra work figuring
out whether the receiver is even there or not.  GMSMembershipManager now throws away outbound
alerts when a forced-disconnect is in process.

Some of the forced-disconnect shutdown processing has been moved out of the membershp manager's
DisconnectThread that was introduced with the quorum checker in order to set the shutdown
cause, etc, as quickly as possible.

I noticed a lot of TXState log messages at debug level with a Throwable stack trace.  There
was no comment saying why this was being done so I commented it out.

JGroups logging level is now set to FATAL by default.  The default log level was a problem
during network partitions because each message send was causing a dire warning to be logged.

I observed a number of threads being left behind when a locator failed to start during auto-reconnect
testing.  I added a unit test to LocatorJUnitTest for this and fixed the leaks.

> Replace JGroups 2.2.9
> ---------------------
>                 Key: GEODE-77
>                 URL:
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Bruce Schuchardt
>            Assignee: Bruce Schuchardt
>            Priority: Blocker
>             Fix For: 1.0.0-incubating
>         Attachments: GEODE-MembershipManagerFunctionalSpecification-130715-1604-29054.pdf
> The JGroups 2.2.9 sources that are currently included in Geode must be replaced in order
for Geode to leave incubation.  A wiki document has been created to investigate alternatives.

This message was sent by Atlassian JIRA

View raw message