ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Karachentsev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-4473) Client should re-try connection attempt in case of concurrent network failure
Date Mon, 06 Mar 2017 11:02:32 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-4473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897095#comment-15897095

Dmitry Karachentsev commented on IGNITE-4473:

1. On exchange when IOException caught and local node is client, is IgniteCouldReconnectCheckedException
2. It's processed in IgniteKernal.start() method and signals that client should be reconnected
to cluster.
3. For that purpose added rejoin() method to GridDiscoveryManager and ClientImpl. It means
that client should initiate disconnect from cluster to force run all node leave routines,
and try to join again. 
4. When start script catches IgniteCouldReconnectCheckedException it calls rejoin() and waits
on reconnect future. If thrown other exception, node will be stopped.
5. This will block user thread on node start and will be released once rejoin succeeded.
6. Added method onReconnectFailed() to GridKernalGateway that completes reconnect future with
exception. This exception will be processed in IgniteKernal rejoin loop.
7. ClientImpl.SocketWriter.forceLeave() blocks until node left message will be sent (or sending
failed) and closes connection to cluster.

Left to do:
Add test and code for the case when client was disconnected from cluster, but connection to
coordinator wasn't fully restored. Client node should continue rejoining unless coordinator
become available.

> Client should re-try connection attempt in case of concurrent network failure
> -----------------------------------------------------------------------------
>                 Key: IGNITE-4473
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4473
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.8
>            Reporter: Vladimir Ozerov
>            Assignee: Dmitry Karachentsev
>             Fix For: 2.0
> *Problem*
> Consider the following scenario:
> 1) Client started, but there are no servers, so it hangs somewhere inside start routine.
> 2) Server appears, and discovery finishes successfully.
> 3) Nodes start talking to each other through communication SPI to finish start process
(e.g. to complete exchange).
> 4) But network glitch occurs and server becomes unreachable.
> *Expected behavior*
> Client disconnects and hangs waiting for reconnect.
> *Actual behavior*
> Client throws an exception and never tries to reconnect.

This message was sent by Atlassian JIRA

View raw message