From dev-return-46691-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Fri Jul 19 16:10:47 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 9E716180665 for ; Fri, 19 Jul 2019 18:10:47 +0200 (CEST) Received: (qmail 90945 invoked by uid 500); 19 Jul 2019 16:10:46 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 90933 invoked by uid 99); 19 Jul 2019 16:10:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Jul 2019 16:10:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E47EC182B07 for ; Fri, 19 Jul 2019 16:10:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.2 X-Spam-Level: X-Spam-Status: No, score=-0.2 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id s_ZVoAdDE9U5 for ; Fri, 19 Jul 2019 16:10:43 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.161.45; helo=mail-yw1-f45.google.com; envelope-from=nsamelchev@gmail.com; receiver= Received: from mail-yw1-f45.google.com (mail-yw1-f45.google.com [209.85.161.45]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 2C766BC7D4 for ; Fri, 19 Jul 2019 16:10:43 +0000 (UTC) Received: by mail-yw1-f45.google.com with SMTP id q128so13719713ywc.1 for ; Fri, 19 Jul 2019 09:10:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=CQWoisGrKtGTtVcE3ydN+9WcuTxnwZNHEK7ZWsyrqks=; b=m331iN2ksj1NcqLaS93KHvON47sYKACytuNn8y+/Q7PMxx8thqWUGqyYJEiFii4LsV 3pKaCRwUKra6VWkFHHc1yOCwX8HqzmGu85ydNn/MIMNyaV68dIVGODWColGkHPlxwk3d nucig3Fr1i/dV/mmWpvuJSZQLyElWR7WMWlK2h51NZ/EpP+HDo9TaQUMZ5+iEpQ5VOcg +jp94RS2zs7cLQ5OfSHXSFoaVv5Y4bkXOA2rD0pY9CLqlAeEwXCmB31swcMhRowT1JEl v/8HS8yYkTXuDo4M6lD3xmm2hkvZG7FX/pgS7sieeMQylS7URGmx+SDvPBxX23r6k2xh bZKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=CQWoisGrKtGTtVcE3ydN+9WcuTxnwZNHEK7ZWsyrqks=; b=qz6zM6tWOPcbwmJ5/dG572yJ4ZQOqQFcY1189OQwGceL5cc8N98hNsIQjoK8cuRZSe Pn+BRMBGGH5nmrCcRzk2PTf2+6SegRm4QwoL65trHaQH92Vxt9Q849MMGKsXso6qlmAt tfC+pTtxrcjfNTcPFO9VvL3A8KW3hhs3owrwjFLdDS3oOJuNL8wrhTDExITfcrCHpEs7 xGsKW5LdlPj/MiKFjZw3EQ7WNcKczq7xJVLwZZYbMLtITeG/gWu+/f0Y7OXNmE19Onjy Z3lKXVbBZuQSsDJ/3CTZUnPUkaHK/xGNvaTKvTksCa+QSj3g83W5E9r+PXlYtbPXuER0 CEGg== X-Gm-Message-State: APjAAAX6sttOpk+8lU11BAoapgUWytfq70CiyU/yWW9lWCZln442OkLh gG6OX+sFji5Jcb2+/yosgHK+SW+SIDfOA+EtZEs= X-Google-Smtp-Source: APXvYqw101nnutKeqcbUxqVmtOHPP7/s6XWzBAiqDxvzpHUGx3Ry8Vjwm6ubwJP1VkkglsHJxPiMNoGPQncoVUq6sY0= X-Received: by 2002:a81:1bc9:: with SMTP id b192mr5963072ywb.359.1563552637460; Fri, 19 Jul 2019 09:10:37 -0700 (PDT) MIME-Version: 1.0 References: <8a42e33aa42112dde0c71b5e3b75b8e8a6147337.camel@gmail.com> In-Reply-To: From: Nikita Amelchev Date: Fri, 19 Jul 2019 19:10:26 +0300 Message-ID: Subject: Re: Partition map exchange metrics To: Pavel Kovalenko , dev@ignite.apache.org Cc: av@apache.org, nizhikov@apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Pavel, The main purpose of this metric is >> how much time we wait for resuming cache operations Seems I misunderstood you. Do you mean timestamp or duration here? >> What do you think if we change the boolean value of metric to a long val= ue that represents time in milliseconds when operations were blocked? This time can be calculated as (currentTime - timeSinceOperationsBlocked) in case of timestamp. Duration will be more understandable. It'll be something like getCurrentBlockingPmeDuration. But I haven't come up with a better name yet. =D0=BF=D1=82, 19 =D0=B8=D1=8E=D0=BB. 2019 =D0=B3. =D0=B2 18:30, Pavel Koval= enko : > > Nikita, > > I think getCurrentPmeDuration doesn't show useful information. The main P= ME side effect for end-users is blocking cache operations. Not all PME time= blocks it. > What information gives to an end-user timestamp of "timeSinceOperationsBl= ocked"? For what analysis it can be used and how? > > =D0=BF=D1=82, 19 =D0=B8=D1=8E=D0=BB. 2019 =D0=B3. =D0=B2 17:48, Nikita Am= elchev : >> >> Hi Pavel, >> >> This time already can be obtained from the getCurrentPmeDuration and >> new isOperationsBlockedByPme metrics. >> >> As an alternative solution, I can rework recently added >> getCurrentPmeDuration metric (not released yet). Seems for users it >> useless in case of non-blocking PME. >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when >> blocking started (minimal value of cluster nodes) and 0 if blocking >> ends (there is no running PME). >> >> WDYT? >> >> =D0=BF=D1=82, 19 =D0=B8=D1=8E=D0=BB. 2019 =D0=B3. =D0=B2 15:56, Pavel Ko= valenko : >> > >> > Hi Nikita, >> > >> > Thank you for working on this. What do you think if we change the bool= ean >> > value of metric to a long value that represents time in milliseconds w= hen >> > operations were blocked? >> > Since we have not only JMX and now metrics are periodically exported t= o >> > some backend it can give a more clear picture of how much time we wait= for >> > resuming cache operations instead of instant boolean indicator. >> > >> > =D0=BF=D1=82, 19 =D0=B8=D1=8E=D0=BB. 2019 =D0=B3. =D0=B2 14:41, Nikita= Amelchev : >> > >> > > Anton, Nikolay, >> > > >> > > Thanks for the support. >> > > >> > > For now, we have the getCurrentPmeDuration() metric that does not sh= ow >> > > influence on the cluster correctly. PME can be without blocking >> > > operations. For example, client node join/leave events. >> > > >> > > I suggest add new metric - isOperationsBlockedByPme(). Together, the= se >> > > metrics will show influence of the PME on cluster and user operation= s. >> > > >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone take= a >> > > look? >> > > >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961 >> > > >> > > =D0=B2=D1=82, 16 =D0=B8=D1=8E=D0=BB. 2019 =D0=B3. =D0=B2 14:58, Niko= lay Izhikov : >> > > >> > > > >> > > > I think administator of Ignite cluster should be able to monitor a= ll >> > > Ignite process, including non blocking PME. >> > > > >> > > > =D0=92 =D0=92=D1=82, 16/07/2019 =D0=B2 14:57 +0300, Anton Vinograd= ov =D0=BF=D0=B8=D1=88=D0=B5=D1=82: >> > > > > BTW, >> > > > > Found PME metric - getCurrentPmeDuration(). >> > > > > Seems, it shows exactly PME time and not so useful because of th= is. >> > > > > The goal it so show exactly blocking period. >> > > > > When PME cause no blocking, it's a good PME and I see no reason = to have >> > > > > monitoring related to it :) >> > > > > >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov >> > > wrote: >> > > > > >> > > > > > Anton. >> > > > > > >> > > > > > Why do we need to postpone implementation of this metrics? >> > > > > > For now, implementation of new metric is very simple. >> > > > > > >> > > > > > I think we can implement this metrics as a single contribution= . >> > > > > > >> > > > > > =D0=92 =D0=92=D1=82, 16/07/2019 =D0=B2 13:47 +0300, Anton Vino= gradov =D0=BF=D0=B8=D1=88=D0=B5=D1=82: >> > > > > > > Nikita, >> > > > > > > >> > > > > > > Looks like all we need now is a 1 simple metric: are operati= ons >> > > blocked? >> > > > > > > Just a true or false. >> > > > > > > Lest start from this. >> > > > > > > All other metrics can be extracted from logs now and can be >> > > implemented >> > > > > > > later. >> > > > > > > >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov < >> > > nizhikov@apache.org> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > +1. >> > > > > > > > >> > > > > > > > Nikita, please, go ahead. >> > > > > > > > >> > > > > > > > >> > > > > > > > =D0=B2=D1=82, 16 =D0=B8=D1=8E=D0=BB=D1=8F 2019 =D0=B3., 11= :45 Nikita Amelchev > > > >: >> > > > > > > > >> > > > > > > > > Hello, Igniters. >> > > > > > > > > >> > > > > > > > > I suggest to add some useful metrics about the partition= map >> > > exchange >> > > > > > > > > (PME). For now, the duration of PME stages available onl= y in >> > > log >> > > > > > >> > > > > > files >> > > > > > > > > and cannot be obtained using JMX or other external tools= . [1] >> > > > > > > > > >> > > > > > > > > I made the list of local node metrics that help to under= stand >> > > the >> > > > > > > > > actual status of current PME: >> > > > > > > > > >> > > > > > > > > 1. initialVersion. Topology version that initiates the >> > > exchange. >> > > > > > > > > 2. initTime. Time PME was started. >> > > > > > > > > 3. initEvent. Event that triggered PME. >> > > > > > > > > 4. partitionReleaseTime. Time when a node has finished w= aiting >> > > for >> > > > > > >> > > > > > all >> > > > > > > > > updates and translations on a previous topology. >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single >> > > message. >> > > > > > > > > 6. recieveFullMessageTime. Time when a node received a f= ull >> > > message. >> > > > > > > > > 7. finishTime. Time PME was ended. >> > > > > > > > > >> > > > > > > > > When new PME started all these metrics resets. >> > > > > > > > > >> > > > > > > > > These metrics help to understand: >> > > > > > > > > - how long PME was (current or previous). >> > > > > > > > > - how long awaited for all updates was completed. >> > > > > > > > > - what node blocks PME (didn't send a single message) >> > > > > > > > > - what triggered PME. >> > > > > > > > > >> > > > > > > > > Thoughts? >> > > > > > > > > >> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961 >> > > > > > > > > >> > > > > > > > > -- >> > > > > > > > > Best wishes, >> > > > > > > > > Amelchev Nikita >> > > > > > > > > >> > > >> > > >> > > >> > > -- >> > > Best wishes, >> > > Amelchev Nikita >> > > >> >> >> >> -- >> Best wishes, >> Amelchev Nikita --=20 Best wishes, Amelchev Nikita