hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations
Date Tue, 23 Jun 2015 23:38:45 GMT

    [ https://issues.apache.org/jira/browse/YARN-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598577#comment-14598577

Sangjin Lee commented on YARN-3815:

About flow online aggregation, I am not quite sure on requirement yet. Do we really want real
time for flow aggregated data or some fine-grained time interval (like 15 secs) should be
good enough - if we want to show some nice metrics chart for flow, this should be fine.

Yes, I agree with that. When I said "real time", it doesn't mean real time in the sense that
every metric is accurate to the second. Most likely raw data themselves (e.g. container data)
are written on an interval anyway. Some type of time interval for aggregation is implied.

Any special reason not to handle it in the same way above - as HBase coprocessor? It just
sound like gross-grained time interval. Isn't it?

I do see your point in that what I called the "real time" aggregation can be considered the
same type of aggregation as the "offline" aggregation only on a shorter time interval. However,
we also need to think about the use cases of such aggregated data.

The former type of aggregation is very much something that can be plugged into UI such as
the RM UI or ambari to show more immediate data. These data may change as the user refreshes
the UI. So this is closer to the raw data.

On the other hand, the latter type of aggregation lends itself to more analytical and ad-hoc
analysis of data. These can be used for calculating chargebacks, usage trending, reporting,
etc. Perhaps it could even contain more detailed info than the "real time" aggregated data
for the reporting/data mining purposes. And that's where we would like to consider using phoenix
to enable arbitrary ad-hoc SQL queries.

One analogy [~jrottinghuis] brings up regarding this is OLTP v. OLAP.

That's why we also think it makes sense to do only "offline" (time-based) aggregation for
users and queues. At least in our case with hRaven, there hasn't been a compelling reason
to show user- or queue-aggregated data in semi-real time. It has been perfectly adequate to
show time-based aggregation, as data like this tend to be used more for reporting and analysis.

> [Aggregation] Application/Flow/User/Queue Level Aggregations
> ------------------------------------------------------------
>                 Key: YARN-3815
>                 URL: https://issues.apache.org/jira/browse/YARN-3815
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: Timeline Service Nextgen Flow, User, Queue Level Aggregations (v1).pdf
> Per previous discussions in some design documents for YARN-2928, the basic scenario is
the query for stats can happen on:
> - Application level, expect return: an application with aggregated stats
> - Flow level, expect return: aggregated stats for a flow_run, flow_version and flow 
> - User level, expect return: aggregated stats for applications submitted by user
> - Queue level, expect return: aggregated stats for applications within the Queue
> Application states is the basic building block for all other level aggregations. We can
provide Flow/User/Queue level aggregated statistics info based on application states (a dedicated
table for application states is needed which is missing from previous design documents like
HBase/Phoenix schema design). 

This message was sent by Atlassian JIRA

View raw message