Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Thu, 17 Sep 2015 00:44:46 +0000 (UTC)
From: "Junping Du (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12838495.1434552998000.361442.1442450686680@Atlassian.JIRA>
In-Reply-To: <JIRA.12838495.1434552998000@Atlassian.JIRA>
References: <JIRA.12838495.1434552998000@Atlassian.JIRA>
 <JIRA.12838495.1434552998098@arcas>
Subject: [jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation
 and accumulation for YARN system metrics
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791374#comment-14791374 ] 

Junping Du commented on YARN-3816:
----------------------------------

Reply to Varun's comments:

bq. In TimelineMetric#accumulateTo, can latestMetric be TIME_SERIES ? If not(seems to be the case as per current code), is the else part of the condition if (latestMetric.getType().equals(Type.SINGLE_VALUE)) { required ? Wont be handling TIME_SERIES then ?
I am not sure if I understand your comments correctly. But it definitely support TIME_SERIES type for latestMetrics and handle two types separately.

bq. One of the aggregateMetrics method in TimelineCollector is not required if you do not plan to use it elsewhere. Agree with Naga that a set should suffice here.
YARN-3817 should use this. Will replace with SET as above comments.

bq. Default value can be mentioned explicity for yarn.timeline-service.aggregation.accumulation.enabled in config file.
The same comments as above.

bq. REST API code can be removed I guess if not required for POC.
We use this REST API in our previous PoC. Any significant problem with these REST API code? If not, we can leave it here and enhance it later.

bq. BTW, the aggregate flag appended into column qualifer will be used by offline aggregator?
Yes. I think it can be used by offline aggregator in YARN-3817. Li, would you confirm this?


Reply to Li's comments:
bq. TimelineAggregationBasis.java, Shall we differentiate realtime and offline aggregations? IIUC, APPLICATION type represents realtime aggregation while the other two represents offline aggregation.
I think we can also add another enum of AggregationType to have {online, offline} which is orthogonal to aggregation level here. Until this JIRA, we haven't invovle offline aggregation concept, may be add the concept of offline aggregation in YARN-3817 sounds better?

bq. setToAggregate(), why do we need a final here?
Theoritically, all parameters in method should be added with final tag to mark as immutable except we want to change it inside of method. This is also true for local variable that we don't plan to reassign it after first assignment. Although our convention doesn't force any rules on this, it should be better and let's leave it here.

bq. So in conclusion, if t1 is null, we set delta to 0 since it's the first value to be aggregated. Or else, we aggregate the delta?
That's true. I will update the comments which is slightly confusing people.

bq. I'm fine to keep the else part of the latestMetric.getType().equals(Type.SINGLE_VALUE) for now. Maybe we'd like to update this part when implementing TODO.
TODO is for checking metric_id, not type. We do support latestMetric as TIME_SERIES instead of SINGLE_VALUE, but we pick up the last (latest time) value in our logic. The TODO is saying we should check the metric id (CPU, MEMORY, etc.) should be compatible as we don't want to aggregate CPU metric into Memory metric. Today, this is guranteed by method caller's logic. Someday later, we should also check inside of method as other callers could make wrong on this. In addition, we cannot easily check metric_id should be equal as we need to handle cases that we could do accumulation on different (but compatible) metric IDs, like: CPU metrics accumulated on CPU_AREA metrics, etc. 

bq. l.183, I noticed we're copying the whole list to rebuild the valueList, then only used one element there? Are we sure the values list is small enough all the time?
Do you mean the if case for "op.equals(Operation.REP)"? That's good point. Will fix it in next patch.

bq. TimelineCollector.java, so we put intermediate aggregation status into the collector as some hash maps. Will there be challenges to rebuild these status? We may need to face the situations like nm restart (we're an aux service inside nm).
These hashmaps should be easily rebuild with accessing the raw entity table for all existing container metrics during NM restart.

bq. We need a method similar to appendAggregatedMetricsToEntities in YARN-3817. Actually I believe this is a very helpful method in normal aggregation workflow. It would be great if we can find a more visible place to put it.
Ok. I can make it as static method somewhere to share.

bq. I have one question about doAccumulation: in offline aggregation we're using a two-staged aggregation: we perform a flow run level aggregation in the mapper, and then aggregate entities in the same flow again in the reducer for entities from different mappers. In this case, we need to set the doAccumulation parameter into false for the aggregation in the reducer, right?
I think so. Your reducer seems to be just sum up flow run metrics (aggregated and accumulated) together so you should set doAccumulation to false.

Will update patch according to above comments soon.

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Application Level Aggregation of Timeline Data.pdf, YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, YARN-3816-YARN-2928-v2.patch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: resource (CPU, Memory) consumption across all containers, number of containers launched/completed/failed, etc. We need this for apps while they are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based on Application-level aggregations rather than raw entity-level data as much less raws need to scan (with filter out non-aggregated entities, like: events, configurations, etc.).


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)