ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexey Goncharuk (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (IGNITE-10933) Node may hang on join to topology and not move forward
Date Wed, 23 Jan 2019 15:29:00 GMT

     [ https://issues.apache.org/jira/browse/IGNITE-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexey Goncharuk updated IGNITE-10933:
--------------------------------------
    Fix Version/s: 2.8

> Node may hang on join to topology and not move forward
> ------------------------------------------------------
>
>                 Key: IGNITE-10933
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10933
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladislav Pyatkov
>            Assignee: Alexei Scherbakov
>            Priority: Major
>             Fix For: 2.8
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Several nodes join to topology simultaneously and hang on a long time.
> That can be on first start all cluster nodes or join nodes to completed topology.
> In the logs of problem nodes can see messages:
> {noformat}
> 2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] Node has not
been connected to topology and will repeat join process. Check remote nodes logs for possible
error messages. Note that large topology may require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property
if getting this message on the starting nodes [networkTimeout=5000]
>  2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] Node has
not been connected to topology and will repeat join process. Check remote nodes logs for possible
error messages. Note that large topology may require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property
if getting this message on the starting nodes [networkTimeout=5000]
> ...
> {noformat}
> and so for a long time without others.
> UPDATE: such behavior is caused by transferring TcpDiscoveryClientReconnectMessage stored
in pending objects collection to joining node causing socket connection invalidation to joining
node and marking it as failed.
> Reproduced by the following scenario:
> 1. Create topology in specific order: srv1 srv2 client srv3 srv4
> 2. Delay client reconnect.
> 3. Trigger topology change by restarting srv2 (will trigger reconnect to next node),
srv3, srv4
> 4. Resume reconnect to node with empty EnsuredMessageHistory (triggering discovery message
of type TcpDiscoveryClientReconnectMessage) and wait for completion.
> 5. Add new node to topology.
> New node will fail with assertion or forever will stuck on join depending on timings.
> Same scenario could be probably triggered by temporary connection loss to joining node.
> [~v.pyatkov], thanks for help with the investigation.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message