geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GEODE-7031) Attempts to send messages to alert listeners delays network partition detection
Date Fri, 02 Aug 2019 18:38:00 GMT

    [ https://issues.apache.org/jira/browse/GEODE-7031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899131#comment-16899131
] 

ASF subversion and git services commented on GEODE-7031:
--------------------------------------------------------

Commit a10af1ba201161c8cf3f8003a12c187728e2874e in geode's branch refs/heads/develop from
Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=a10af1b ]

GEODE-7031 Attempts to send messages to alert listeners delays network partition detection

Decrease the socket-formation timeout for Alert listeners.  Generally
we'll already have a connection to an alert listener so the decreased
timeout won't be used.  In times where there are network problems,
though, we often have to create a new tcp/ip connection to send an alert
and we don't want these to stall for too long.


> Attempts to send messages to alert listeners delays network partition detection
> -------------------------------------------------------------------------------
>
>                 Key: GEODE-7031
>                 URL: https://issues.apache.org/jira/browse/GEODE-7031
>             Project: Geode
>          Issue Type: Improvement
>          Components: membership
>            Reporter: Bruce Schuchardt
>            Assignee: Bruce Schuchardt
>            Priority: Major
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In a number of recent regression test runs in AWS we have seen network partition detection
tests fail to detect the partition in a reasonable amount of time.  Logs show membership
services attempting to send alerts to other processes that are no longer reachable.  Each
attempt takes 6 * the member-timeout setting - that's 30 seconds for each attempt.  It would
be nice to have a different connection-formation timeout for something like this since alert
notification is built into the logging system that membership services have to use.  Since
the alert system is also dependent on membership services functioning properly this creates
a circular dependency that has historically caused hangs and delays such as the one described
here.
> {noformat}
> [debug 2019/07/29 14:35:03.824 PDT <Geode Failure Detection thread 5> tid=0xc3]
Sending (Alert "Unable to send message to 10.32.108.136(gemfire3_host2_12249:12249)<v3>:41003"
level WARNING) to 1 peers ([10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001])
via tcp/ip
> [debug 2019/07/29 14:35:03.825 PDT <Geode Failure Detection thread 5> tid=0xc3]
created PendingConnection org.apache.geode.internal.tcp.ConnectionTable$PendingConnection@4f4c8630
created by Geode Failure Detection thread 5
> [info 2019/07/29 14:35:33.847 PDT <Geode Failure Detection thread 5> tid=0xc3]
Connection: shared=true ordered=true failed to connect to peer 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001
because: java.net.SocketTimeoutException
> [debug 2019/07/29 14:35:33.852 PDT <Geode Failure Detection thread 5> tid=0xc3]
Giving up connecting to alert listener 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001{noformat}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message