From issues-return-72783-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Thu Aug 23 17:00:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 78A7B18061A for ; Thu, 23 Aug 2018 17:00:04 +0200 (CEST) Received: (qmail 32443 invoked by uid 500); 23 Aug 2018 15:00:03 -0000 Mailing-List: contact issues-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list issues@ignite.apache.org Received: (qmail 32422 invoked by uid 99); 23 Aug 2018 15:00:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Aug 2018 15:00:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D1AB61810EF for ; Thu, 23 Aug 2018 15:00:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id u6oWnLNHbzNG for ; Thu, 23 Aug 2018 15:00:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 536F75F4E5 for ; Thu, 23 Aug 2018 15:00:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 93591E070F for ; Thu, 23 Aug 2018 15:00:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4E1672468F for ; Thu, 23 Aug 2018 15:00:00 +0000 (UTC) Date: Thu, 23 Aug 2018 15:00:00 +0000 (UTC) From: "Pavel Kovalenko (JIRA)" To: issues@ignite.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (IGNITE-9309) LocalNodeMovingPartitionsCount metrics may calculates incorrect due to processFullPartitionUpdate MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/IGNITE-9309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590338#comment-16590338 ] Pavel Kovalenko edited comment on IGNITE-9309 at 8/23/18 2:59 PM: ------------------------------------------------------------------ The actual problem was introduced in https://issues.apache.org/jira/browse/IGNITE-8684 . The key issue that partition state changes now happens only after receiving FullMap with exchangeId (PME). There can be race between handling FullMap with echangeId != null (PME) and FullMap without exchangeId. If we receive fresh FullMap without exchangeId earlier than with, we override our local partition states, and FullMap with exchangeId will be rejected as outdated. It means that the partition states will never be changed and no rebalance will start. was (Author: jokser): The actual problem was introduced in https://issues.apache.org/jira/browse/IGNITE-8684 . The key problem that partition state changes now happened only after receiving FullMap with exchangeId (PME). There can be race between handling FullMap with echangeId != null (PME) and FullMap without exchangeId. If we receive fresh FullMap without exchangeId earlier than with, we override our local partition states, and FullMap with exchangeId will be rejected as outdated. It means that the partition states will not be changed and no rebalance will start. > LocalNodeMovingPartitionsCount metrics may calculates incorrect due to processFullPartitionUpdate > ------------------------------------------------------------------------------------------------- > > Key: IGNITE-9309 > URL: https://issues.apache.org/jira/browse/IGNITE-9309 > Project: Ignite > Issue Type: Bug > Affects Versions: 2.6 > Reporter: Maxim Muzafarov > Priority: Major > > [~qvad] have found incorrect {{LocalNodeMovingPartitionsCount}} metrics calculation on client node {{JOIN\LEFT}}. Full issue reproducer is absent. > Probable scenario: > {code} > Repeat 10 times: > 1. stop node > 2. clean lfs > 3. add stopped node (trigger rebalance) > 4. 3 times: start 2 clients, wait for topology snapshot, close clients > 5. for each cache group check JMX metrics LocalNodeMovingPartitionsCount (like waitForFinishRebalance()) > {code} > Whole discussion and all configuration details can be found in comments of [IGNITE-7165|https://issues.apache.org/jira/browse/IGNITE-7165]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)