From common-issues-return-153173-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Sat Jun  2 19:39:06 2018
Return-Path: <common-issues-return-153173-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 59473180677
	for <archive-asf-public@cust-asf.ponee.io>; Sat,  2 Jun 2018 19:39:06 +0200 (CEST)
Received: (qmail 72601 invoked by uid 500); 2 Jun 2018 17:39:05 -0000
Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:common-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:common-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:common-issues@hadoop.apache.org>
List-Id: <common-issues.hadoop.apache.org>
Delivered-To: mailing list common-issues@hadoop.apache.org
Received: (qmail 72496 invoked by uid 99); 2 Jun 2018 17:39:05 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jun 2018 17:39:05 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id AD5D0180518
	for <common-issues@hadoop.apache.org>; Sat,  2 Jun 2018 17:39:04 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -109.511
X-Spam-Level:
X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8,
	RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01,
	USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id uUVO4SgWqYl5 for <common-issues@hadoop.apache.org>;
	Sat,  2 Jun 2018 17:39:02 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id AD9D95F201
	for <common-issues@hadoop.apache.org>; Sat,  2 Jun 2018 17:39:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id CFBD5E015E
	for <common-issues@hadoop.apache.org>; Sat,  2 Jun 2018 17:39:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5F07120EAB
	for <common-issues@hadoop.apache.org>; Sat,  2 Jun 2018 17:39:00 +0000 (UTC)
Date: Sat, 2 Jun 2018 17:39:00 +0000 (UTC)
From: "Steve Loughran (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: <JIRA.13112651.1509122477000.88653.1527961140337@Atlassian.JIRA>
In-Reply-To: <JIRA.13112651.1509122477000@Atlassian.JIRA>
References: <JIRA.13112651.1509122477000@Atlassian.JIRA> <JIRA.13112651.1509122477396@jira-lw-us.apache.org>
Subject: [jira] [Updated] (HADOOP-14989) metrics2 JMX cache refresh result
 in inconsistent Mutable(Stat|Rate) values
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran updated HADOOP-14989:
------------------------------------
    Target Version/s: 2.7.8  (was: 2.7.7)

> metrics2 JMX cache refresh result in inconsistent Mutable(Stat|Rate) values
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-14989
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14989
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: metrics
>    Affects Versions: 2.6.5
>            Reporter: Erik Krogen
>            Priority: Critical
>         Attachments: HADOOP-14989.test.patch
>
>
> While doing some digging in the metrics2 system recently, we noticed that the way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it is based off of {{MutableStat}}) mean that every time the value is snapshotted, all previous information is lost. So every time a JMX cache refresh occurs, it resets the {{MutableStat}}, meaning that all configured metrics sinks do not consider the previous statistics in their emitted values. The same behavior is true if you configured multiple sink periods.
> {{MutableStat}}, to compute its average value, maintains a total value since last snapshot, as well as operation count since last snapshot. Upon snapshotting, the average is calculated as (total / opCount) and placed into a gauge metric, and total / operation count are cleared. So the average value represents the average since the last snapshot. If we have only a single sink period ever snapshotting, this would result in the expected behavior that the value is the average over the reporting period. However, if multiple sink periods are configured, or if the JMX cache is refreshed, this is another snapshot operation. So, for example, if you have a FileSink configured at a 60 second interval and your JMX cache refreshes itself 1 second before the FileSink period fires, the values emitted to your FileSink only represent averages _over the last one second_.
> A few ways to solve this issue:
> * Make {{MutableRate}} manage its own average refresh, similar to {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the last quantile values that it will serve up until the next refresh. Given how many {{MutableRate}} metrics there are, a thread per metric is not really feasible, but could be done on e.g. a per-source basis. This has some downsides: if multiple sinks are configured with different periods, what is the right refresh period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation count, rather than an average gauge and an operation count counter. The average could then be calculated downstream from this information. This is cumbersome for operators and not backwards compatible. To improve on both of those downsides, we could have it keep the current behavior but _additionally_ emit the total as a counter. The snapshotted average is probably sufficient in the common case (we've been using it for years), and when more guaranteed accuracy is required, the average could be derived from the total and operation count.
> The two above suggestions will fix this for both JMX and multiple sink periods, but may be overkill. Multiple sink periods are probably not necessary though we should at least document the behavior.
> Open to suggestions & input here.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org