hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics
Date Fri, 08 Apr 2016 04:22:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231612#comment-15231612

Sangjin Lee commented on YARN-3816:

[~gtCarrera9], thanks much for posting an updated patch for this! I just had an opportunity
to go over it fairly completely once, and have some high-level comments as well as more detailed
code feedback.

Starting with high-level comments:
1. "aggregation" v. "accumulation"
This came up several times on this JIRA, and I think the distinction is crucial in getting
this completed. I believe what we agreed on is as follows: "aggregation" is about rolling
up metrics from a child type to a parent type (e.g. rolling up metrics from containers to
applications), and "accumulation" is about computing/deriving secondary values based on the
*time dimension* (e.g. area under the curve or the running maximum). Those two are rather
independent, and we should not mix them.

Unfortunately in the latest patch, these two terms are used very much interchangeably. Can
we make that distinction clear and rename all the classes/methods/variables that pertain to
accumulation from "aggregation" to "accumulation"? It would be good if we reserve "aggregation"
to child-to-parent rollups.

2. container-to-application aggregation
Related to above, this JIRA was meant to implement 2 features: (1) "aggregating" metrics from
containers to applications, and (2) "accumulating" metrics for (certain) entity types. Both
should be done. However, in the latest patch, *I do not see (1) being done*. In other words,
I didn't find code that rolls up metrics from the container entities and sets them to the
parent application entities. Am I missing something? The previous patches did implement that.
Without this, we will *NOT* see things like container CPU or memory being rolled up to applications,
and as a consequence to flow runs, and so on. This is a MUST.

IMO that is a separate functionality from the accumulation. I think we should do it clearly
and explicitly. And the rolled-up metrics should be set onto the application entities.

3. time-based accumulation
We also said that the time-based accumulation should be conditional on a configuration (see
[the previous patch|https://issues.apache.org/jira/secure/attachment/12761120/YARN-3816-YARN-2928-v4.patch]).
I see that condition is not there in the latest patch. Can we please make the accumulation
conditional on that configuration?

Also, this was an issue with the previous patches and I think it exists with the latest patch.
It appears that we are doing the time-based accumulation for *all metrics for all entity types*.
We might want to think about whether that would be OK. There are some performance and storage
implications in doing so. Also, I raised some semantic issues with that idea. See the previous
comment [here|https://issues.apache.org/jira/browse/YARN-3816?focusedCommentId=15067321&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15067321].
I'm not 100% certain if the latest patch has the same issue or not although I suspect it might.

I also raised a concern whether we should use a separate entity type for this. First of all,
the "aggregation" (from containers to applications) *should* go to the actual application
type. Second, even for "accumulation" you might want to think about what you want to do. I
assume that the accumulated metrics (YARN_APPLICATION_AGGREGATION) are being written to the
entities table. Note that they are not really considered as part of the application, and are
not available for application queries. So there is an implication for queries. And they are
not going to be aggregated up to the flow runs.

I know this is a lot to parse, and obviously there is much history in this discussion. However,
it would help to replay the main discussions up to this point so that we don't lose these
important points. Thanks much!

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Li Lu
>              Labels: yarn-2928-1st-milestone
>         Attachments: Application Level Aggregation of Timeline Data.pdf, YARN-3816-YARN-2928-v1.patch,
YARN-3816-YARN-2928-v2.1.patch, YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch,
YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, YARN-3816-YARN-2928-v3.patch,
YARN-3816-YARN-2928-v4.patch, YARN-3816-YARN-2928-v5.patch, YARN-3816-feature-YARN-2928.v4.1.patch,
YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: resource (CPU,
Memory) consumption across all containers, number of containers launched/completed/failed,
etc. We need this for apps while they are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggregated to show
details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based on Application-level
aggregations rather than raw entity-level data as much less raws need to scan (with filter
out non-aggregated entities, like: events, configurations, etc.).

This message was sent by Atlassian JIRA

View raw message