hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics
Date Thu, 17 Sep 2015 00:44:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791374#comment-14791374

Junping Du commented on YARN-3816:

Reply to Varun's comments:

bq. In TimelineMetric#accumulateTo, can latestMetric be TIME_SERIES ? If not(seems to be the
case as per current code), is the else part of the condition if (latestMetric.getType().equals(Type.SINGLE_VALUE))
{ required ? Wont be handling TIME_SERIES then ?
I am not sure if I understand your comments correctly. But it definitely support TIME_SERIES
type for latestMetrics and handle two types separately.

bq. One of the aggregateMetrics method in TimelineCollector is not required if you do not
plan to use it elsewhere. Agree with Naga that a set should suffice here.
YARN-3817 should use this. Will replace with SET as above comments.

bq. Default value can be mentioned explicity for yarn.timeline-service.aggregation.accumulation.enabled
in config file.
The same comments as above.

bq. REST API code can be removed I guess if not required for POC.
We use this REST API in our previous PoC. Any significant problem with these REST API code?
If not, we can leave it here and enhance it later.

bq. BTW, the aggregate flag appended into column qualifer will be used by offline aggregator?
Yes. I think it can be used by offline aggregator in YARN-3817. Li, would you confirm this?

Reply to Li's comments:
bq. TimelineAggregationBasis.java, Shall we differentiate realtime and offline aggregations?
IIUC, APPLICATION type represents realtime aggregation while the other two represents offline
I think we can also add another enum of AggregationType to have {online, offline} which is
orthogonal to aggregation level here. Until this JIRA, we haven't invovle offline aggregation
concept, may be add the concept of offline aggregation in YARN-3817 sounds better?

bq. setToAggregate(), why do we need a final here?
Theoritically, all parameters in method should be added with final tag to mark as immutable
except we want to change it inside of method. This is also true for local variable that we
don't plan to reassign it after first assignment. Although our convention doesn't force any
rules on this, it should be better and let's leave it here.

bq. So in conclusion, if t1 is null, we set delta to 0 since it's the first value to be aggregated.
Or else, we aggregate the delta?
That's true. I will update the comments which is slightly confusing people.

bq. I'm fine to keep the else part of the latestMetric.getType().equals(Type.SINGLE_VALUE)
for now. Maybe we'd like to update this part when implementing TODO.
TODO is for checking metric_id, not type. We do support latestMetric as TIME_SERIES instead
of SINGLE_VALUE, but we pick up the last (latest time) value in our logic. The TODO is saying
we should check the metric id (CPU, MEMORY, etc.) should be compatible as we don't want to
aggregate CPU metric into Memory metric. Today, this is guranteed by method caller's logic.
Someday later, we should also check inside of method as other callers could make wrong on
this. In addition, we cannot easily check metric_id should be equal as we need to handle cases
that we could do accumulation on different (but compatible) metric IDs, like: CPU metrics
accumulated on CPU_AREA metrics, etc. 

bq. l.183, I noticed we're copying the whole list to rebuild the valueList, then only used
one element there? Are we sure the values list is small enough all the time?
Do you mean the if case for "op.equals(Operation.REP)"? That's good point. Will fix it in
next patch.

bq. TimelineCollector.java, so we put intermediate aggregation status into the collector as
some hash maps. Will there be challenges to rebuild these status? We may need to face the
situations like nm restart (we're an aux service inside nm).
These hashmaps should be easily rebuild with accessing the raw entity table for all existing
container metrics during NM restart.

bq. We need a method similar to appendAggregatedMetricsToEntities in YARN-3817. Actually I
believe this is a very helpful method in normal aggregation workflow. It would be great if
we can find a more visible place to put it.
Ok. I can make it as static method somewhere to share.

bq. I have one question about doAccumulation: in offline aggregation we're using a two-staged
aggregation: we perform a flow run level aggregation in the mapper, and then aggregate entities
in the same flow again in the reducer for entities from different mappers. In this case, we
need to set the doAccumulation parameter into false for the aggregation in the reducer, right?
I think so. Your reducer seems to be just sum up flow run metrics (aggregated and accumulated)
together so you should set doAccumulation to false.

Will update patch according to above comments soon.

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Application Level Aggregation of Timeline Data.pdf, YARN-3816-YARN-2928-v1.patch,
YARN-3816-YARN-2928-v2.1.patch, YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch,
YARN-3816-YARN-2928-v2.patch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: resource (CPU,
Memory) consumption across all containers, number of containers launched/completed/failed,
etc. We need this for apps while they are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggregated to show
details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based on Application-level
aggregations rather than raw entity-level data as much less raws need to scan (with filter
out non-aggregated entities, like: events, configurations, etc.).

This message was sent by Atlassian JIRA

View raw message