ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Karachentsev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-8985) Node segmented itself after connRecoveryTimeout
Date Fri, 13 Jul 2018 08:49:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-8985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542723#comment-16542723
] 

Dmitry Karachentsev commented on IGNITE-8985:
---------------------------------------------

Here are few things that caused this behavior.
1. One node was killed.
2. Previous for it was unable to connect and tried to go to next of the killed.
3. As we have 60 secs of failure detection timeout, then connection check frequency will be
60 / 3 = 20 secs. So it means that previous node will be treated as failed if there was no
message during 20 secs. In the other hand, recovery timeout is 10 secs.
4. Another case is that each node has two loopback addresses, when one of them 172.17.0.1:47500
is not determined as localhost and was checked. In other words node checked connection to
itself.

To fix it should be applied loopback check from IGNITE-8683 ticket and add IGNITE-8944 to
mark node as failed faster.

> Node segmented itself after connRecoveryTimeout
> -----------------------------------------------
>
>                 Key: IGNITE-8985
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8985
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Cherkasov
>            Assignee: Dmitry Karachentsev
>            Priority: Major
>         Attachments: Archive.zip
>
>
> I can see the following message in logs:
> [2018-07-10 16:27:13,111][WARN ][tcp-disco-msg-worker-#2] Unable to connect to next nodes
in a ring, it seems local node is experiencing connectivity issues. Segmenting local node
to avoid case when one node fails a big part of cluster. To disable that behavior set TcpDiscoverySpi.setConnectionRecoveryTimeout()
to 0. [connRecoveryTimeout=10000, effectiveConnRecoveryTimeout=10000]
> [2018-07-10 16:27:13,112][WARN ][disco-event-worker-#61] Local node SEGMENTED: TcpDiscoveryNode
[id=e1a19d8e-2253-458c-9757-e3372de3bef9, addrs=[127.0.0.1, 172.17.0.1, 172.25.1.17], sockAddrs=[/172.17.0.1:47500,
lab17.gridgain.local/172.25.1.17:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2,
lastExchangeTime=1531229233103, loc=true, ver=2.4.7#20180710-sha1:a48ae923, isClient=false]
> I have failure detection time out 60_000 and during the test I had GC <25secs, so
I don't expect that node should be segmented.
>  
> Logs are attached.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message