Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 11BE4200D3C for ; Tue, 31 Oct 2017 04:34:06 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 104A3160BF8; Tue, 31 Oct 2017 03:34:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5554F160BE4 for ; Tue, 31 Oct 2017 04:34:05 +0100 (CET) Received: (qmail 28221 invoked by uid 500); 31 Oct 2017 03:34:04 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 28210 invoked by uid 99); 31 Oct 2017 03:34:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Oct 2017 03:34:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 99C031A3025 for ; Tue, 31 Oct 2017 03:34:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id eUGePGGjoaqC for ; Tue, 31 Oct 2017 03:34:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 81B3460F5A for ; Tue, 31 Oct 2017 03:34:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id A458BE0EB1 for ; Tue, 31 Oct 2017 03:34:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3AF4F212F7 for ; Tue, 31 Oct 2017 03:34:00 +0000 (UTC) Date: Tue, 31 Oct 2017 03:34:00 +0000 (UTC) From: "Xiao Chen (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-14960) Add GC time percentage monitor/alerter MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 31 Oct 2017 03:34:06 -0000 [ https://issues.apache.org/jira/browse/HADOOP-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226186#comment-16226186 ] Xiao Chen commented on HADOOP-14960: ------------------------------------ Thanks for contributing this pretty cool GC monitor Misha! Looks good to me in general. Review comments, mostly nits: - High level, have you thought about using 1 data structure to hold both the gcPause and timestamp ring buffers? I think a SortedMap would fit well in our use case. The current way (2 arrays) may be the most efficient, I'd prefer readability here since I assume this would be a very small part of memory consumption for the service running it. - Suggest to do some input validation in {{GcTimeMonitor}} constructor: timestamps should not be negative, and also should not create too big a buffer size (this may go away if we change data structures :) ). {{maxGcTimePercentage}} also only makes sense for a (0, 100) value. - This actually does not really discard any buffers {code} // Discard buffer entries that are older than curTime - observationWindowMs long startObsWindowTs = ts - observationWindowMs; while (tsBuf[startIdx] < startObsWindowTs && startIdx != endIdx) { startIdx = (startIdx + 1) % bufSize; } {code} - For the {{startIdx}} and {{endIdx}}, we should handle integer overflows of {{(index + 1) % bufSize}}. Maybe have a method like {{incrementInRing}}. - We can {{Preconditions.checkNotNull}} on the input param in {{JvmMetrics#setGcTimeMonitor}}. (I know the current setPauseMonitor doesn't check, let's do better than that :) ) - Should we make the methods synchronized so we don't have to worry about {{the user observes inconsistent values}}? - Typo in test: {code} // Run this for at least 1 sec for our monitor collects enough data // to // Run this for at least 1 sec for our monitor to collect enough data {code} > Add GC time percentage monitor/alerter > -------------------------------------- > > Key: HADOOP-14960 > URL: https://issues.apache.org/jira/browse/HADOOP-14960 > Project: Hadoop Common > Issue Type: Improvement > Reporter: Misha Dmitriev > Assignee: Misha Dmitriev > Attachments: HADOOP-14960.01.patch > > > Currently class {{org.apache.hadoop.metrics2.source.JvmMetrics}} provides several metrics related to GC. Unfortunately, all these metrics are not as useful as they could be, because they don't answer the first and most important question related to GC and JVM health: what percentage of time my JVM is paused in GC? This percentage, calculated as the sum of the GC pauses over some period, like 1 minute, divided by that period - is the most convenient measure of the GC health because: > - it is just one number, and it's clear that, say, 1..5% is good, but 80..90% is really bad > - it allows for easy apple-to-apple comparison between runs, even between different apps > - when this metric reaches some critical value like 70%, it almost always indicates a "GC death spiral", from which the app can recover only if it drops some task(s) etc. > The existing "total GC time", "total number of GCs" etc. metrics only give numbers that can be used to rougly estimate this percentage. Thus it is suggested to add a new metric to this class, and possibly allow users to register handlers that will be automatically invoked if this metric reaches the specified threshold. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org