ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexandr Kuramshin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-4111) Communication fails to send message if target node did not finish join process
Date Sat, 19 Nov 2016 21:14:58 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679867#comment-15679867

Alexandr Kuramshin commented on IGNITE-4111:

I have try to get the simulation when the nodes in topology have received TcpDiscoveryNodeAddFinishedMessage
but the joining node still has not.
I've add the delay on processing TcpDiscoveryNodeAddFinishedMessage, decrease the communication
timeout, and start 10 nodes simultaneously.
Then I have got an error after the 10th node started:
Failed to send job request, caused by Failed to send message to remote node, caused by Failed
to connect to node (is node still alive?),
caused by Failed to perform handshake due to timeout (consider increasing 'connectionTimeout'
configuration property).

But after some discovery I've found that the onFirstMessage method don't hang. It successfully
executes every time when onMessage is invoked.

> Communication fails to send message if target node did not finish join process
> ------------------------------------------------------------------------------
>                 Key: IGNITE-4111
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4111
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>            Reporter: Semen Boikov
>            Assignee: Alexandr Kuramshin
>             Fix For: 2.0
>         Attachments: test onFirstMessage hang.log
> Currently this scenario is possible:
> - joining node sent join request and waits for TcpDiscoveryNodeAddFinishedMessage inside
> - others nodes already see this node and can send messages to it (for example try to
run compute job on this node)
> - joining node can not receive message: TcpCommunicationSpi will hang inside 'onFirstMessage'
on 'getSpiContext' call, so sending node will get error trying to establish connection
> Possible fix: if in onFirstMessage() spi context is not available, then TcpCommunicationSpi
 should send special response which indicates that this node is not ready yet, and sender
should retry after some time.
> Also need check internal code for places where message can be unnecessarily sent to node:
one such place is GridCachePartitionExchangeManager.refreshPartitions - message is sent to
all known nodes, but here we can filter by node order / finished exchage version.

This message was sent by Atlassian JIRA

View raw message