hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Misha Dmitriev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14960) Add GC time percentage monitor/alerter
Date Tue, 07 Nov 2017 02:55:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241399#comment-16241399
] 

Misha Dmitriev commented on HADOOP-14960:
-----------------------------------------

Test failures above seem totally unrelated. [~xiaochen] maybe you can rerun the build for
my latest patch?

> Add GC time percentage monitor/alerter
> --------------------------------------
>
>                 Key: HADOOP-14960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14960
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: HADOOP-14960.01.patch, HADOOP-14960.02.patch, HADOOP-14960.03.patch
>
>
> Currently class {{org.apache.hadoop.metrics2.source.JvmMetrics}} provides several metrics
related to GC. Unfortunately, all these metrics are not as useful as they could be, because
they don't answer the first and most important question related to GC and JVM health: what
percentage of time my JVM is paused in GC? This percentage, calculated as the sum of the GC
pauses over some period, like 1 minute, divided by that period - is the most convenient measure
of the GC health because:
> - it is just one number, and it's clear that, say, 1..5% is good, but 80..90% is really
bad
> - it allows for easy apple-to-apple comparison between runs, even between different apps
> - when this metric reaches some critical value like 70%, it almost always indicates a
"GC death spiral", from which the app can recover only if it drops some task(s) etc.
> The existing "total GC time", "total number of GCs" etc. metrics only give numbers that
can be used to rougly estimate this percentage. Thus it is suggested to add a new metric to
this class, and possibly allow users to register handlers that will be automatically invoked
if this metric reaches the specified threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message