hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3816) [Aggregation] App-level Aggregation for YARN system metrics
Date Tue, 04 Aug 2015 20:26:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654298#comment-14654298
] 

Junping Du commented on YARN-3816:
----------------------------------

Sorry for coming late on this. Thanks [~sjlee0], [~vrushalic] and [~gtCarrera9] for review
and comments!
bq. I propose to introduce the second dimension to the metrics explicitly. This second dimension
nearly maps to "toAggregate" (and/or the REP/SUM distinction) in your patch. But I think it's
probably better to introduce the metric types explicitly as another enum or by subclassing
TimelineMetric. Let me know what you think.
Do you suggest to use gauge and counter type to replace "toAggregate"? But no matter counter
or gauge type of metrics, we may need to do aggregation. e.g. CPU usage as guage, or map task
number (launched, failed, etc.) as counter (assume value is tack-on instead of accumulated).
The idea to involve "toAggregate" in metric is for client to indicate if this metric value
should be added/aggregated with other values or is a final value. If a client put a metrics
value that is already aggregated (like HDFS bytes written/read), collector won't apply any
aggregation logic on it. 

bq. I'm still very confused by the usage of the word "aggregate". In this patch, "aggregate"
really means accumulating values of a metric along the time dimension, which is completely
different than the notion of aggregation we have used all along. The aggregation has always
been about rolling up values from children to parents. Can we choose a different word to describe
this aspect of accumulating values along the time dimension, and avoid using "aggregation"
for this? "Accumulate"? "Cumulative"? Any suggestion?
Actually, v2 patch has both. In TimelineCollector, AggregatedMetrics mean rolling up values
from children to parents while AggregatedArea means accumulating aggregated values of a metric
along the time dimension. It may not be necessary to separate calculating AggregatedArea out
as a separated method. Isn't it? It is a bit rush of naming for poc but we can have some better
one later.

bq. For example, consider HDFS bytes written. The time accumulation is already built into
it (see (1)). If you further accumulate this along the time dimension, it becomes quadratic
(doubly integrated) in time. I don't see how that can be useful.
You are right. For some cases as you mentioned here, time accumulation is not very useful.
So beside "toAggregate", we may also need another flag (like: "toAccumulate") to indicate
if metric value need to be accumulated along the time dimension? As we don't have assumption
that all counters are already accumulated over time or not, the client has flexibility to
put accumulated or tack-on values for counter. Thoughts?

bq. I think it would be OK to do this and not the average/max of the previous discussion.
I'd like to hear what others think about this.
Either way should work as we have area value at all timestamps, we can recalculate average/max
later if necessary. Would like to hear others' comments too.

bq. Can we introduce a configuration that disables this time accumulation feature? As we discussed,
some may not want to have this feature enabled and are perfectly happy with simple aggregation
(from children to parents). It would be good to isolate this part and be able to enable/disable
it.
We surely can disable accumulation in system level. We also can disable accumulation in metrics
level (like proposed above) even accumulation is enabled in system level.

bq. For timeseries, we need to decide what aggregation means. One option is that we could
normalize the values to a minute level granularity. For example, add up values per min across
each time. So anything that occurred within a minute will be assigned to the top of that minute:
eg if something happening at 2 min 10 seconds is considered to have occurred at 2 min. That
way we can sum up across flows/users/runs etc.
The other option is we only record /store accumulated values at different timestamp and we
do delta calculation later if necessary. This can address more time granularity as query could
apply on different granularity.

> [Aggregation] App-level Aggregation for YARN system metrics
> -----------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Application Level Aggregation of Timeline Data.pdf, YARN-3816-poc-v1.patch,
YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: resource (CPU,
Memory) consumption across all containers, number of containers launched/completed/failed,
etc. We need this for apps while they are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggregated to show
details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based on Application-level
aggregations rather than raw entity-level data as much less raws need to scan (with filter
out non-aggregated entities, like: events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message