hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics
Date Mon, 14 Dec 2015 17:47:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056362#comment-15056362

Junping Du commented on YARN-3816:

Thanks [~sjlee0], [~varun_saxena] and Li's comments. I am rebase the patch with YARN-4356
and incorporating your comments above. Some quick response for your major comments above for
more feedback:
bq. It appears that the current code will aggregate metrics from all types of entities to
the application. This seems problematic to me. The main goal of this aggregation is to roll
up metrics from individual containers to the application. But just by having the same metric
id, any entity can have its metric aggregated by this (incorrectly). For example, any arbitrary
entity can simply declare a metric named "MEMORY". By virtue of that, it would get aggregated
and added to the application-level value. There can be variations of this: for example, the
same metrics can be reported by the container entity, app attempt entity, and so on. Then
the values may be aggregated double or triple. I think we should ensure strongly that the
aggregation happens only along the path of YARN container entities to application to prevent
these accidental cases.
That sounds a reasonable concern here. I agree that we should get rid of metrics get messed
up between system metrics and application's metrics. However, I think our goal here is not
just aggregate/accumulate container metrics, but also provide aggregation service to applications'
metrics (other than MR). Isn't it? If so, may be a better way is to aggregate metrcis along
not only metric name but also its original entity type (so memory metrics for ContainerEntity
won't be aggregated against memory metrics from Application Entity). [~sjlee0], What do you

bq. On a semi-related note, what happens if clients send metrics directly at the application
entity level? We should expect most framework-specific AMs to do that. For example, MR AM
already has all the job-level counters, and it can (and should) report those job-level counters
as metrics at the YARN application entity. Is that case handled correctly, or will we end
up getting incorrect values (double counting) in that situation?
That's why we need the api of toAggregate() in TimelineMetric. For metrics that get aggregated
already (like MR AM's counter), it should set it to false to get rid of double counting. Sounds

bq. calculating area under the curve along the time dimension, would it be useful by itself?
Average based on this area under the curve seems more useful.
Yes. Both overall and average values are useful in different stand point. Former value can
be used to represent how much resources the application actually consume that is very useful
in billing cloud service, etc. We can extend later to more values if we think it worth. Varun,
make sense?

bq. There are 3 types of aggregation basis, but only application aggregation has its own entity
type. How do we represent the result entity of the other 2 types?
I don't quite understand what's the question here. Li, are u suggesting we should remove application
aggregation entity type, add flow/queue aggregation entity type or keep them consistent?

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>              Labels: yarn-2928-1st-milestone
>         Attachments: Application Level Aggregation of Timeline Data.pdf, YARN-3816-YARN-2928-v1.patch,
YARN-3816-YARN-2928-v2.1.patch, YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch,
YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, YARN-3816-YARN-2928-v3.patch,
YARN-3816-YARN-2928-v4.patch, YARN-3816-feature-YARN-2928.v4.1.patch, YARN-3816-poc-v1.patch,
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: resource (CPU,
Memory) consumption across all containers, number of containers launched/completed/failed,
etc. We need this for apps while they are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggregated to show
details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based on Application-level
aggregations rather than raw entity-level data as much less raws need to scan (with filter
out non-aggregated entities, like: events, configurations, etc.).

This message was sent by Atlassian JIRA

View raw message