ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexei Scherbakov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (IGNITE-10933) Node may hang on join to topology and not move forward
Date Fri, 18 Jan 2019 09:15:00 GMT

     [ https://issues.apache.org/jira/browse/IGNITE-10933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexei Scherbakov updated IGNITE-10933:
---------------------------------------
    Description: 
Several nodes join to topology simultaneously and hang on a long time.

That can be on first start all cluster nodes or join nodes to completed topology.

In the logs of problem nodes can see messages:
{noformat}
2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] Node has not been
connected to topology and will repeat join process. Check remote nodes logs for possible error
messages. Note that large topology may require sig
nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if
getting this message on the starting nodes [networkTimeout=5000]

 2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] Node has not been
connected to topology and will repeat join process. Check remote nodes logs for possible error
messages. Note that large topology may require sig
nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if
getting this message on the starting nodes [networkTimeout=5000]

...

{noformat}
and so for a long time without others.

UPDATE: such behavior is caused by transferring TcpDiscoveryClientReconnectMessage stored
in pending objects collection to joining node causing socket connection invalidation to joining
node and marking it as failed.

Reproduced by the following scenario:

1. Create topology in specific order: srv1 srv2 client srv3 srv4
2. Delay client reconnect.
3. Trigger topology change by restarting srv2 (will trigger reconnect to next node), srv3,
srv4
4. Resume reconnect to node with empty EnsuredMessageHistory (triggering discovery message
of type TcpDiscoveryClientReconnectMessage) and wait for completion.
5. Add new node to topology.

New node will fail with assertion or forever will stuck on join depending on timings.

Same scenario could be probably triggered by temporary connection loss to joining node.

[~v.pyatkov], thanks for help with the investigation.

 

 

  was:
Several nodes join to topology simultaneously and hang on a long time.

That can be on first start all cluster nodes or join nodes to completed topology.

In the logs of problem nodes can see messages:
{noformat}
2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] Node has not been
connected to topology and will repeat join process. Check remote nodes logs for possible error
messages. Note that large topology may require sig
nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if
getting this message on the starting nodes [networkTimeout=5000]

 2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] Node has not been
connected to topology and will repeat join process. Check remote nodes logs for possible error
messages. Note that large topology may require sig
nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if
getting this message on the starting nodes [networkTimeout=5000]

...

{noformat}
and so for a long time without others.


> Node may hang on join to topology and not move forward
> ------------------------------------------------------
>
>                 Key: IGNITE-10933
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10933
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladislav Pyatkov
>            Assignee: Alexei Scherbakov
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Several nodes join to topology simultaneously and hang on a long time.
> That can be on first start all cluster nodes or join nodes to completed topology.
> In the logs of problem nodes can see messages:
> {noformat}
> 2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] Node has not
been connected to topology and will repeat join process. Check remote nodes logs for possible
error messages. Note that large topology may require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property
if getting this message on the starting nodes [networkTimeout=5000]
>  2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] Node has
not been connected to topology and will repeat join process. Check remote nodes logs for possible
error messages. Note that large topology may require sig
> nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property
if getting this message on the starting nodes [networkTimeout=5000]
> ...
> {noformat}
> and so for a long time without others.
> UPDATE: such behavior is caused by transferring TcpDiscoveryClientReconnectMessage stored
in pending objects collection to joining node causing socket connection invalidation to joining
node and marking it as failed.
> Reproduced by the following scenario:
> 1. Create topology in specific order: srv1 srv2 client srv3 srv4
> 2. Delay client reconnect.
> 3. Trigger topology change by restarting srv2 (will trigger reconnect to next node),
srv3, srv4
> 4. Resume reconnect to node with empty EnsuredMessageHistory (triggering discovery message
of type TcpDiscoveryClientReconnectMessage) and wait for completion.
> 5. Add new node to topology.
> New node will fail with assertion or forever will stuck on join depending on timings.
> Same scenario could be probably triggered by temporary connection loss to joining node.
> [~v.pyatkov], thanks for help with the investigation.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message