ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Chugunov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-11348) Ping node procedure may fail when another node leaves the cluster
Date Wed, 20 Feb 2019 07:55:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772734#comment-16772734

Sergey Chugunov commented on IGNITE-11348:

[~dpavlov], the whole sequence of events leading to the issue looks like as following:
# _leaving node_ sitting on a *host0:port0* disco address leaves the cluster (address becomes
# _new node_ binds to the same *host0:port0* address and sends join request;
# _old node_ receives join request and starts pinging _new node_;
# NODE_LEFT event for _leaving node_ arrives to _old node_; as part of handling of NODE_LEFT
socket for ongoing ping is closed (incorrectly as this ping has nothing to do with _leaving

To avoid this situation I add nodeID to ping future and check it before closing socket on
NODE_LEFT. The ID enables to distinguish ping request to _new node_ despite of _new node_
and _leaving node_ have the same disco address.

> Ping node procedure may fail when another node leaves the cluster
> -----------------------------------------------------------------
>                 Key: IGNITE-11348
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11348
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Sergey Chugunov
>            Assignee: Sergey Chugunov
>            Priority: Critical
>             Fix For: 2.8
> Additional pinging of node on join implemented in IGNITE-5569 may incorrectly fail leading
to shutting down joining node.
> The reason for this is that if another node from the same host bound to the same discovery
port as joining node has left the cluster right before joining node, socket used for pinging
gets closed.
> This leads to the situation when pinging node considers joining node as "unreachable"
and fails it with JOIN_IMPOSSIBLE error code.
> Workaround: simply start again node failed on join.

This message was sent by Atlassian JIRA

View raw message