ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Chugunov (Jira)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
Date Tue, 20 Aug 2019 12:07:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911258#comment-16911258

Sergey Chugunov commented on IGNITE-10808:


Along with MetricsUpdate message your change also affects TcpDiscoveryClientAckResponse which
no longer be processed with priority to other messages. This may be risky. 
Could you check what are the consequences for client nodes' stability if acks are delivered
to them with some delay?


> Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
> --------------------------------------------------------------------------
>                 Key: IGNITE-10808
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10808
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.7
>            Reporter: Stanislav Lukyanov
>            Assignee: Denis Mekhanikov
>            Priority: Major
>              Labels: discovery
>             Fix For: 2.8
>         Attachments: IgniteMetricsOverflowTest.java
> A node receives a new metrics update message every `metricsUpdateFrequency` milliseconds,
and the message will be put at the top of the queue (because it is a high priority message).
> If processing one message takes more than `metricsUpdateFrequency` then multiple `TcpDiscoveryMetricsUpdateMessage`
will be in the queue. A long enough delay (e.g. caused by a network glitch or GC) may lead
to the queue building up tens of metrics update messages which are essentially useless to
be processed. Finally, if processing a message on average takes a little more than `metricsUpdateFrequency`
(even for a relatively short period of time, say, for a minute due to network issues) then
the message worker will end up processing only the metrics updates and the cluster will essentially
> Reproducer is attached. In the test, the queue first builds up and then very slowly being
teared down, causing "Failed to wait for PME" messages.
> Need to change ServerImpl's SocketReader not to put another metrics update message to
the top of the queue if it already has one (or replace the one at the top with new one).

This message was sent by Atlassian Jira

View raw message