From issues-return-86972-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Tue Dec 25 13:51:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 871AD180677 for ; Tue, 25 Dec 2018 13:51:03 +0100 (CET) Received: (qmail 71757 invoked by uid 500); 25 Dec 2018 12:51:02 -0000 Mailing-List: contact issues-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list issues@ignite.apache.org Received: (qmail 71748 invoked by uid 99); 25 Dec 2018 12:51:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Dec 2018 12:51:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 3102CC01C1 for ; Tue, 25 Dec 2018 12:51:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id X1pbByajGE4H for ; Tue, 25 Dec 2018 12:51:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id E78C85F5A7 for ; Tue, 25 Dec 2018 12:51:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 82364E13DA for ; Tue, 25 Dec 2018 12:51:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 2A1C125352 for ; Tue, 25 Dec 2018 12:51:00 +0000 (UTC) Date: Tue, 25 Dec 2018 12:51:00 +0000 (UTC) From: "Stanislav Lukyanov (JIRA)" To: issues@ignite.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (IGNITE-10808) Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728702#comment-16728702 ] Stanislav Lukyanov commented on IGNITE-10808: --------------------------------------------- There are two parts in this problem: 1) The queue may grow indefinitely if metrics updates are generated faster than they're processed. This can be solved by removing all of the updates but the latest one. When a new metrics update is added to the queue, we should check if there is another metrics update in the queue already. If there then replace the old one with the new one (at the same place in the queue). We should be careful and only replace the metrics update on their first ring pass - the messages on the second ring pass should be left in the queue. 2) The metrics updates may take too much of the discovery worker capacity leading to starvation-type issues. This can be solved by making metrics update normal priority instead of high priority. To avoid triggering failure detection we need to make sure that all messages, not only metrics updates, reset the failure detection timer. > Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage > -------------------------------------------------------------------------- > > Key: IGNITE-10808 > URL: https://issues.apache.org/jira/browse/IGNITE-10808 > Project: Ignite > Issue Type: Bug > Reporter: Stanislav Lukyanov > Priority: Major > Attachments: IgniteMetricsOverflowTest.java > > > A node receives a new metrics update message every `metricsUpdateFrequency` milliseconds, and the message will be put at the top of the queue (because it is a high priority message). > If processing one message takes more than `metricsUpdateFrequency` then multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long enough delay (e.g. caused by a network glitch or GC) may lead to the queue building up tens of metrics update messages which are essentially useless to be processed. Finally, if processing a message on average takes a little more than `metricsUpdateFrequency` (even for a relatively short period of time, say, for a minute due to network issues) then the message worker will end up processing only the metrics updates and the cluster will essentially hang. > Reproducer is attached. In the test, the queue first builds up and then very slowly being teared down, causing "Failed to wait for PME" messages. > Need to change ServerImpl's SocketReader not to put another metrics update message to the top of the queue if it already has one (or replace the one at the top with new one). -- This message was sent by Atlassian JIRA (v7.6.3#76005)