ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryabov Dmitrii (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-5580) Improve node failure cause information
Date Fri, 29 Dec 2017 12:02:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16306228#comment-16306228
] 

Ryabov Dmitrii commented on IGNITE-5580:
----------------------------------------

[~agoncharuk], I used {{TcpDiscoveryNodeFailedMessage.warning(String)}} to send message about
failure. This warning is logged by {{IgniteUtils}} logger during processing node failed message,
but was used only in 2 cases ({{TcpCommunicationSpi.checkClientQueueSize()}} and {{.createTcpClient()}}).

Is this form of logging good? Do we need more detailed messages?

Also when node fails I log latest events on all nodes.

> Improve node failure cause information
> --------------------------------------
>
>                 Key: IGNITE-5580
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5580
>             Project: Ignite
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 1.7
>            Reporter: Alexey Goncharuk
>            Assignee: Ryabov Dmitrii
>              Labels: observability
>
> When a node fails, we do not print out any information about the root cause of this failure.
This makes it extremely hard to investigate the failure causes - I need to find a previous
node for the failed node and check the logs on the previous node.
> I suggest that we add extensive information about the reason of the node failure and
the sequence of events that led to this, e.g.:
> [time] [NODE] Sending a message to next node - failed _because_ - write timeout, read
timeout, ...?
> [time] [NODE] Connection check - failed - why? Connection refused, handshake timed out,
...?
> ...
> [time] [NODE] Decided to drop the node because of the sequence above
> Maybe we do not need to print out this information always, but we do need this when troubleshooting
logger is enabled.
> Also, DiscoverySpi should collect a set of latest important events and dump these events
in case of local node segmentation. This will allow users to match the events in the cluster
and events on local node and get to the bottom of the failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message