hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3816) [Aggregation] App-level Aggregation for YARN system metrics
Date Tue, 14 Jul 2015 16:14:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626567#comment-14626567
] 

Sangjin Lee commented on YARN-3816:
-----------------------------------

[~djp], thanks for your POC patch!

I understand that more things need to be added, but I wanted to share some initial comments
and questions.

(1)
If I understand correctly, this patch basically does a *time integral* of a given metric,
or "the area under the curve" for the metric as a function of time. For example, if the underlying
metric is a container CPU usage, the "aggregated" metric according to {{TimelineMetric.aggregateTo()}}
would be a cumulative CPU usage over time for that container (in the units of CPU-millis).

While this is certainly a useful number to keep track of, this was not the app-level aggregation
I had in mind. IMO, the app-level aggregation (or any aggregation for that matter) is all
about *rolling metrics up from child entities to the parent entity*. I would have thought
that it would be the first thing we want to get to. It looks, however, as though that aggregation
is not done in this patch. I don't see any code that rolls up values from containers to the
application. Are you planning to introduce that soon?

(2)
This type of time integral works *only if* the underlying metric is a gauge. For example,
for any counter-like metric (e.g. HDFS bytes read) which is cumulative in nature, the time
integral does not make sense. We will need to introduce another type dimension to the metrics
that signifies whether it is a counter or a gauge, but this is just to note that the time
integral works only for gauges.

(3)
Also, this is pretty similar to what we talked about during the offline meeting as "average/max"
for gauges, except that it's not divided over time. We discussed that we want to introduce
time averages and maxes for gauges (see "time average & max" in https://issues.apache.org/jira/secure/attachment/12743390/aggregation-design-discussion.pdf).
Are we thinking of replacing that with this?

(4)
In the specific case of container CPU usage, it seems to me that emitting the actual CPU time
millis directly would be a far easier and more accurate way to capture this info. I believe
it's readily available, and it would be a counter-like metric instead of a gauge. Therefore
the time integral doesn't apply (as it already is one). But all you need to do at the app-level
aggregation for it is just to sum it up. I recognize that this time integral would be useful
for other things, but just wanted to point that out.

I'd like to hear your thoughts on this. Thanks!

> [Aggregation] App-level Aggregation for YARN system metrics
> -----------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Application Level Aggregation of Timeline Data.pdf, YARN-3816-poc-v1.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: resource (CPU,
Memory) consumption across all containers, number of containers launched/completed/failed,
etc. We need this for apps while they are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggregated to show
details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based on Application-level
aggregations rather than raw entity-level data as much less raws need to scan (with filter
out non-aggregated entities, like: events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message