ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-6700) Node considered as failed can cause failure of others nodes
Date Tue, 07 Nov 2017 05:57:01 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241533#comment-16241533
] 

ASF GitHub Bot commented on IGNITE-6700:
----------------------------------------

GitHub user akuramshingg opened a pull request:

    https://github.com/apache/ignite/pull/2984

    IGNITE-6700 Node considered as failed can cause failure of others nodes

    Independent asynchronous connection checkers for the previous and the next nodes.
    TcpDiscoveryHandshakeResponse carries failed node and generates TcpDiscoveryNodeFailedMessage
on a sender.
    Remember the list of recently failed server nodes.
    Synchronized access to sendMessageAcrossRing().
    Local node freeze detection.
    
    New TcpDiscoverySplitTest based on IgniteCacheTopologySplitAbstractTest.
    CacheLateAffinityAssignmentTest and TcpDiscoverySelfTest update.
    
    GridDhtPartitionTopologyImpl update: IGNITE-6433 workaround.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gridgain/apache-ignite ignite-6700-new

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/ignite/pull/2984.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2984
    
----
commit f7875d21ed3364da87cab02b249a14af227a941b
Author: Alexandr Kuramshin <ein.nsk.ru@gmail.com>
Date:   2017-11-07T05:55:24Z

    IGNITE-6700 Node considered as failed can cause failure of others nodes
    
    Independent asynchronous connection checkers for the previous and the next nodes.
    TcpDiscoveryHandshakeResponse carries failed node and generates TcpDiscoveryNodeFailedMessage
on a sender.
    Remember the list of recently failed server nodes.
    Synchronized access to sendMessageAcrossRing().
    Local node freeze detection.
    
    New TcpDiscoverySplitTest based on IgniteCacheTopologySplitAbstractTest.
    CacheLateAffinityAssignmentTest and TcpDiscoverySelfTest update.
    
    GridDhtPartitionTopologyImpl update: IGNITE-6433 workaround.

----


> Node considered as failed can cause failure of others nodes
> -----------------------------------------------------------
>
>                 Key: IGNITE-6700
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6700
>             Project: Ignite
>          Issue Type: Bug
>      Security Level: Public(Viewable by anyone) 
>          Components: general
>            Reporter: Semen Boikov
>            Assignee: Alexandr Kuramshin
>            Priority: Critical
>
> Node considered as failed can cause failure of others nodes in cluster. 
> There is an issue in TcpDiscoveryAbstractMessage.failedNodes processing, if message is
received from node considered as failed, then failedNodes should be ignored.
> Possible scenario:
> - there are 4 nodes (1 -> 2 -> 3 -> 4)
> - node 3 temporary lost connection with others
> - node 2 considers 3 as failed, node failed event is fired for 3
> - node 3 considers 4 as failed, adds 4 in nodeFailedList, then it restores connection
with 1 and currently 1 will process nodeFailedList from 3 (even if 3 is already considered
as failed)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message