ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Denis Magda (JIRA)" <j...@apache.org>
Subject [jira] [Created] (IGNITE-2656) Documentation on debugging and fixing the reasons of node disconnection from the cluster
Date Tue, 16 Feb 2016 09:23:18 GMT
Denis Magda created IGNITE-2656:

             Summary: Documentation on debugging and fixing the reasons of node disconnection
from the cluster
                 Key: IGNITE-2656
                 URL: https://issues.apache.org/jira/browse/IGNITE-2656
             Project: Ignite
          Issue Type: Bug
            Reporter: Denis Magda
            Assignee: Denis Magda
            Priority: Critical
             Fix For: 1.6

Sometimes a node can be abruptly kicked off from the cluster buy some reason.

The documentation must contain information on how to get to the root of the issue by looking
at logs files. Usually the node that was kicked off contains "Local node segmented" message
and the node that failed its next neighbor contains a message with more details "Failed to
send message to next node".

Next the article must list possible reasons of the disconnection:
- long GC pauses. Give recommendations on how to check;
- high node utilization so that it responds with a delay;
- low network configuration parameters that are not suited for an environment;

There should be a section about {{IgniteConfiguration.failureDetectionTimeout}} describing
its behavior and showing all its pros and cons.
The article must say when it makes sense to 'disable' this timeout by switching to explicit
configuration of TcpDiscoverySpi.socketTimeout, TcpDiscoverySpi.ackTimeout, TcpDiscoverySpi.maxAckTimeout,
TcpDiscoverySpi.reconnectCount. Pros and cons of manual configuration has to be mentioned
as well.

Also I would list the usage of TcpDiscoverySpi.joinTimeout,
TcpDiscoverySpi.networkTimeout (used on client reconnect, servers waits for join result, node
stop, socket reader first message.) there as well.

This message was sent by Atlassian JIRA

View raw message