hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
Date Tue, 23 Jun 2015 23:47:44 GMT

    [ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598590#comment-14598590

Sangjin Lee commented on YARN-3815:

Moving from offline discussions...

Now aggregation of *time series metrics* is rather tricky, and needs to be defined. Would
an aggregated metric (e.g. at the flow level) of time series metrics (e.g. at the app level)
be a time series itself? I see several problems with defining that as a time series. Individual
app time series may be sampled at different times, and it's not clear what time series the
aggregated flow metric would be.

I think it might be simpler to say that an aggregated flow metric of time series may not need
to be a time series itself.

On the one hand, there is a general issue of at what time the aggregated values belong, regardless
of whether they are time series or not. If all leaf values are recorded at the same time,
it would be unambiguous. The aggregated metric value is of the same time. However, it is rarely
the case.

I think the current implicit behavior in hadoop is simply to take the latest values and add
them up. One example is the MR counters (task level and job level). The task level counters
are obtained at different times. Still, the corresponding job counters are simply sums of
all the latest task counters, although they may have been taken at different times. We're
basically taking that as an approximation that's "good enough". In the end, the final numbers
will become accurate. In other words, the final values would truly be the accurate aggregate

The time series basically adds another wrinkle to this. In case of a simple value, the final
values are going to be correct, so this problem is less of an issue, but time series will
retain intermediate values. Furthermore, their publishing interval may have no relationship
with the publishing interval of the leaf values. I think the baseline approach should be either
(1) do not use time series for the aggregated metrics, or (2) just to the best effort approximation
by adding up the latest leaf values and store it with its own timestamp.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> ------------------------------------------------------------
>                 Key: YARN-3815
>                 URL: https://issues.apache.org/jira/browse/YARN-3815
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: Timeline Service Nextgen Flow, User, Queue Level Aggregations (v1).pdf
> Per previous discussions in some design documents for YARN-2928, the basic scenario is
the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version and flow 
> - User level, expect return: aggregated stats for applications submitted by user
> - Queue level, expect return: aggregated stats for applications within the Queue
> Application states is the basic building block for all other level aggregations. We can
provide Flow/User/Queue level aggregated statistics info based on application states (a dedicated
table for application states is needed which is missing from previous design documents like
HBase/Phoenix schema design). 

This message was sent by Atlassian JIRA

View raw message