geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <>
Subject [jira] [Commented] (GEODE-1542) shared/unordered tcp/ip connection times out, initiating suspicion
Date Wed, 15 Jun 2016 00:08:49 GMT


ASF subversion and git services commented on GEODE-1542:

Commit 33ceb371554a13c7643ddaf9488ffa83963de1e7 in incubator-geode's branch refs/heads/feature/GEODE-1372
from [~bschuchardt]
[;h=33ceb37 ]

GEODE-1542 shared/unordered tcp/ip connection times out, initiating suspicion

This disables timing out of shared/unordered TcpConduit connections.  We don't
want them to time out because we are using them to initiate suspect processing
on other members.

The ticket also pointed out a problem with the "final check" mechanism in
the health monitor.  I tracked that down to improper use of SocketCreator
to create the server-socket in GMSHealthMonitor.  It was creating sn SSL
socket if SSL is enabled but the client-side of the check uses non-SSL
sockets.  I changed the server to use non-SSL sockets as well since no
useful information is sent over the final-check TCP/IP connections & they
need to be lightweight and fast.

While looking at logs I also found that the heartbeat request sent at the
beginning of a final-check had a request-ID even though it's not waiting
for a response.  That causes processing of the response to do more work
than necessary so I changed it to remove the request-ID from the message.

> shared/unordered tcp/ip connection times out, initiating suspicion
> ------------------------------------------------------------------
>                 Key: GEODE-1542
>                 URL:
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>            Reporter: Bruce Schuchardt
>             Fix For: 1.0.0-incubating.M3
> I recently diagnosed a membership failure that was initiated when one member (N) timed
out its shared/unordered tcp/ip connection to another member (M).  Member M initiated suspect
processing that lead to kicking member N out of the system.  We need to either stop timing
out shared/unordered connections or have an orderly shutdown mechanism so that we don't initiate
suspect processing.
> The final-check that M performed showed something odd.  Member N never logged that it
processed a final check from M.  Member M logged that it had connected to N and read a status
byte from it.  The byte had the value 21, which is not a valid response to a final check (it
should be 0 or 0x7B).
> {noformat}
> Received [21, ent(clientgemfire3_ent_19225:19225)<ec><v1>:1028]
> {noformat}
> I verified that M used the correct tcp/ip port for N, so this is very odd and needs to
be investigated.

This message was sent by Atlassian JIRA

View raw message