From dev-return-46408-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Mon Jun 24 12:11:52 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8B67A180671 for ; Mon, 24 Jun 2019 14:11:52 +0200 (CEST) Received: (qmail 96206 invoked by uid 500); 24 Jun 2019 12:11:51 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 96194 invoked by uid 99); 24 Jun 2019 12:11:50 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Jun 2019 12:11:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 580B0180F4F for ; Mon, 24 Jun 2019 12:11:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.15 X-Spam-Level: X-Spam-Status: No, score=0.15 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.25, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id BcfuXzLA09j8 for ; Mon, 24 Jun 2019 12:11:48 +0000 (UTC) Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 4BE515F56E for ; Mon, 24 Jun 2019 12:11:48 +0000 (UTC) Received: by mail-lj1-f179.google.com with SMTP id r9so12369651ljg.5 for ; Mon, 24 Jun 2019 05:11:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:message-id:subject:to:date:in-reply-to:references :mime-version; bh=IspwBk05V8yqKSPLzM410/5gEtOpiVcps23WQdUKp50=; b=gLeH5e+W035RdV9EpY/TKpdgVCo5B8oWQBfwkUTL56UbBb4pC38Yg1+p1y7IPiemCb Wuxq2kGg/L9s7D18kXx2gOneKl5BGNuWXPnap5CFb7v0HJ/LKuxJmMFA4escEwb5GNEf Clm5NGjEJe4Lf7OBNEj978QSvFMjry/ozX23Y7pu0bCclf6lubjWnIZOxziOM4AsjgDo zwFScQFN9vT03hXoUoVugScUEBHI2BBWlhnhXUTQzTe4tt0yyIgg/THm/FK9r63GtjSI XJi5aKvkknHQlrxLG60ucs3fI9+JrBgwCX2ZsOMNwhBOdU/Sgf5dH4PjEyXU/sF6WELm Ltqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:message-id:subject:to:date :in-reply-to:references:mime-version; bh=IspwBk05V8yqKSPLzM410/5gEtOpiVcps23WQdUKp50=; b=Vbo4ionEsziKtgyDpSrK/tb6uVyyQPxnode3Of+qt/uoAIbD4k4LYdlrQLygiNBOcF VlpSbQmY/muIQdVqhrLc4vydI8qw6ijPij1ZIAaM4+ITS29CDTR6rWahuqHtc+fol8JP Xz53KyjcuMRnJNJWXeHnNEyvuAbZ9Fv524VFoj/HCd6q4GIGH/aaxFaikb0oAJT6drLb tJaf2zMt3MIyd0n1IvtcIsZcsBxGKYWaaEK0Rzo1h/BIwo1czinocbgAl0DPH3vntPVD NKbSqNzobqeqwQNRN1/ZCIhEx530WEJymcqUaM6Vu2SIjX1n71vlCbyrwhCi2nfbeS4Z RFbA== X-Gm-Message-State: APjAAAVvMsCo/svBMmV2QhzdajBxOL9KK/KlaiYmRQZSIBdHLKPosbb3 SGmnhCRZGtO5cAKST/h/tFTx38kn X-Google-Smtp-Source: APXvYqyRr/Llgn70fBRQvnUiyrHgfccnU/VL2oZ3A5lMQS0PDrGl8gwzEQUM2jXGJ50uLwxDBQYBkw== X-Received: by 2002:a2e:970a:: with SMTP id r10mr21163095lji.115.1561378301567; Mon, 24 Jun 2019 05:11:41 -0700 (PDT) Received: from newDragon ([194.186.207.143]) by smtp.googlemail.com with ESMTPSA id r20sm1251943ljr.20.2019.06.24.05.11.40 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 24 Jun 2019 05:11:40 -0700 (PDT) Sender: =?UTF-8?B?0J3QuNC60L7Qu9Cw0Lkg0JjQttC40LrQvtCy?= From: Nikolay Izhikov X-Google-Original-From: Nikolay Izhikov Message-ID: Subject: Re: [IEP-35] GridJobProcessorMetrics migration To: dev@ignite.apache.org Date: Mon, 24 Jun 2019 15:14:06 +0300 In-Reply-To: References: Content-Type: multipart/signed; micalg="pgp-sha512"; protocol="application/pgp-signature"; boundary="=-RbZIU9K4m3V7z72HmL2j" X-Mailer: Evolution 3.28.5-0ubuntu0.18.04.1 Mime-Version: 1.0 --=-RbZIU9K4m3V7z72HmL2j Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello, Alex. Thanks for the answer. 1. I, actually, don't understand your proposal :) Can you write it down?=20 What numbers should be additionally migrated in this PR?=20 Or it's OK for now? > I think "idle time" is a useful metric I think "usefulness" or "uselessness" of specific metrics depends on the qu= estions we can answer with it. What questions we can ask about Ignite instance and answer with "idle time"= metric? > About execution and waiting time , it's not the right way to calculate it > using a jobs list.=20 Same question here. What questions we can answer with current numbers? > Will jobs list contain only active jobs? All jobs that are scheduled for execution on the node(active + waiting) sho= uld be in the list. I try to put more details here, to expose my way of thinking about metrics = and lists: If you have some issues with the jobs on the node it can be 2 kinds of issu= es:=20 1. You are waiting for the results of some job and want to know why it doe= sn't execute. In this case, you should query "jobs list" from Ignite. You can get an answer on: * What jobs currently executes? * How many time your job waiting to be executed? You can also check "activeJobs", "waitingJobs" metrics graphics to know c= hanges in the jobs queue during the time. Seems, you can predict the start of your job from these numbers. =09 2. You want to understand the lifecycle of some finished(failed job). In this case, you should analyze the log of the node. It should contain information about time: * node recieve job information * job added to the queue * job started execution * job finished(failed) execution. I don't see questions we can't ask from these sources. Do we have such? How numbers from current GridJobMetrics can help with these questions? > But, what if a user doesn't use any > external monitoring system and wants to know the health of Ignite instanc= e? It depends on how we define "health". And it's not trivial question :) > Do we have any plans to implement some simple aggregator and ship it with= Ignite? I think NO. We shouldn't do it. > Do we have plans to provide some presets for Ignite monitoring for > popular monitoring systems? I think we shouldn't do it. Because monitoring presets heavily depends on the usage scenario. And it can heavily vary for the Ignite. =D0=92 =D0=9F=D0=BD, 24/06/2019 =D0=B2 12:46 +0300, Alex Plehanov =D0=BF=D0= =B8=D1=88=D0=B5=D1=82: > Hi Nikolay, >=20 > I think "idle time" is a useful metric, but it can be calculated outside = of > Ignite using external monitoring system. >=20 > About execution and waiting time, it's not the right way to calculate it > using a jobs list. Will jobs list contain only active jobs? In this case, > you can't calculate these metrics at all, since you don't know the time o= f > finished jobs. If the list will contain all jobs (will it be unlimited?), > iterating over this list will be resource consuming. In any way, it's muc= h > simpler (and sometimes only possible) for an external monitoring system t= o > just get some scalar metric than iterate over a list with some condition. >=20 > About aggregation, yes, in an ideal world aggregation should be done with > the external monitoring system. But, what if a user doesn't use any > external monitoring system and wants to know the health of Ignite instanc= e? > Do we have any plans to implement some simple aggregator and ship it with > Ignite? Do we have plans to provide some presets for Ignite monitoring fo= r > popular monitoring systems? (These questions not related to this PR, but > related to IEP at all) >=20 > Also, some aggregation metrics ("max" for example) can't be effectively > calculated using the external system (you should iterate over a jobs list > again and still precision of such calculation will be no more than the ti= me > between probes). --=-RbZIU9K4m3V7z72HmL2j Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEOiTcLcdgyP2exB5ZbiaPbjg91GUFAl0Qvo4ACgkQbiaPbjg9 1GV6rgf/cZham7BozDwF7cGLcg9lBzQXGcKLqy6WmbDOYArT7ukI/QaW7VDNXBGu qnor73jhdhh8ZNPEX6QcJyYIKU4iBtRFIEx9TDrmO8l4aO/LXMuegrLpnG6qn0NG YMCoYaE4hFqlgGBxeOh6BulqSeRFEzEiZE3NPyOY+WNOyqw6CEF8ikveBgvSx+wM 0/2E+cgjTD20vCtkmLFkyC0hF0xLFX15nelvi3aYh2+AIyGbxhKKv0xh43HLzIjO 6VtCHZPQVB+/uT7A9FeCVWp6qnxo2cDW8rtiJ0L51KXy68h3da5CBQykt3CMPbUF dXZJHYhy+bRyc67S1Nh1O7qVlOOaVw== =B36b -----END PGP SIGNATURE----- --=-RbZIU9K4m3V7z72HmL2j--