ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Denis Magda (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (IGNITE-2656) Documentation on debugging and fixing the reasons of node disconnection from the cluster
Date Thu, 11 Aug 2016 01:13:20 GMT

     [ https://issues.apache.org/jira/browse/IGNITE-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Denis Magda updated IGNITE-2656:
    Priority: Major  (was: Critical)

> Documentation on debugging and fixing the reasons of node disconnection from the cluster
> ----------------------------------------------------------------------------------------
>                 Key: IGNITE-2656
>                 URL: https://issues.apache.org/jira/browse/IGNITE-2656
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Magda
>            Assignee: Denis Magda
>             Fix For: 1.8
> Sometimes a node can be abruptly kicked off from the cluster buy some reason.
> The documentation must contain information on how to get to the root of the issue by
looking at logs files. Usually the node that was kicked off contains "Local node segmented"
message and the node that failed its next neighbor contains a message with more details "Failed
to send message to next node".
> Next the article must list possible reasons of the disconnection:
> - long GC pauses. Give recommendations on how to check;
> - high node utilization so that it responds with a delay;
> - low network configuration parameters that are not suited for an environment;
> There should be a section about {{IgniteConfiguration.failureDetectionTimeout}} describing
its behavior and showing all its pros and cons.
> The article must say when it makes sense to 'disable' this timeout by switching to explicit
configuration of TcpDiscoverySpi.socketTimeout, TcpDiscoverySpi.ackTimeout, TcpDiscoverySpi.maxAckTimeout,
TcpDiscoverySpi.reconnectCount. Pros and cons of manual configuration has to be mentioned
as well.
> Also I would list the usage of TcpDiscoverySpi.joinTimeout,
> TcpDiscoverySpi.networkTimeout (used on client reconnect, servers waits for join result,
node stop, socket reader first message.) there as well.

This message was sent by Atlassian JIRA

View raw message