Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 458A2188FE for ; Fri, 19 Jun 2015 17:12:01 +0000 (UTC) Received: (qmail 43994 invoked by uid 500); 19 Jun 2015 17:12:01 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 43951 invoked by uid 500); 19 Jun 2015 17:12:01 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 43939 invoked by uid 99); 19 Jun 2015 17:12:01 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Jun 2015 17:12:01 +0000 Date: Fri, 19 Jun 2015 17:12:01 +0000 (UTC) From: "Sangjin Lee (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-3815) [Aggregation] Application/Flow/User/Queue Level Aggregations MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3815?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14593= 649#comment-14593649 ]=20 Sangjin Lee commented on YARN-3815: ----------------------------------- Thanks [~djp] for putting this together. I added comments in the offline do= c, but I'll move the main one (high level comments) over here. (0) on =E2=80=9Caggregation=E2=80=9D Like you mentioned, I think it is helpful to make distinction on different = types of aggregation we=E2=80=99re talking about here. These are somewhat s= eparate functionalities. My sense of the types of aggregation is similar to= yours, but not exactly the same. It would be good if we can converge on th= eir definitions. I see 4 types of aggregation: - app-level aggregation - app-to-flow aggregation (=E2=80=9Conline=E2=80=9D or =E2=80=9Creal time= =E2=80=9D) - time-based flow aggregation (=E2=80=9Cbatch=E2=80=9D or =E2=80=9Cperiodic= =E2=80=9D) - user/queue aggregation I=E2=80=99ll explain my definitions in more detail below. (1) app-level aggregation This is aggregating metrics from sub-app entities (e.g. containers) to the = YARN application. This can include both framework-specific metrics (e.g. HD= FS bytes written for mapreduce) and YARN-system metrics (e.g. container CPU= %). It would be ideal for app entities to have values for these metrics aggrega= ted from sub-app entities. How we do that is going to be different between = framework-specific metrics and YARN-system metrics. For framework-specific metrics, I would say this falls on the individual fr= ameworks. The framework AM usually already aggregates them in memory (consi= der MR job counters for example). So for them it is straightforward to writ= e them out directly onto the YARN app entities. Furthermore, it is problema= tic to add them to the sub-app YARN entities and ask YARN to aggregate them= to the application. Framework=E2=80=99s sub-app entities may not even alig= n with YARN=E2=80=99s sub-app entities. For example, in case of MR, there i= s a reasonable one-to-one mapping between a mapper/reducer task attempt and= a container, but for other applications that may not be true. Forcing all = frameworks to hang values at containers may not be practical. I think it=E2= =80=99s far easier for frameworks to write aggregated values to the YARN ap= p entities. For YARN-system metrics, this would need to be done by YARN. I think we can= have the timeline collector aggregate the values in memory and write them = out periodically. The details need to be worked out, but that is definitely= one way to go. The only tricky thing is then the container metrics should = flow through the per-app timeline collector, and cannot come from the RM ti= meline collector (Junping pointed that out already). (2) app-to-flow online aggregation This is more or less live aggregated metrics at the flow level. This will s= till be based on the native HBase schema. Actually doing the above for the app-level integration makes app-to-flow on= line aggregation simpler. It now only has to look at app entities to collec= t the data. Initially we were thinking of leveraging a HBase co-processor, but there ar= e some technical challenges with that. We had a discussion on possible ways= of doing this, and [~jrottinghuis] has a proposal for this. I=E2=80=99ll l= et Joep chime in on this. (3) time-based flow aggregation This is different than the online aggregation in the sense that it is aggre= gated along the time boundary (e.g. =E2=80=9Cdaily=E2=80=9D, =E2=80=9Cweekl= y=E2=80=9D, etc.). This can be based on the Phoenix schema. This can be populated in an offlin= e fashion (e.g. running a mapreduce job). (4) user/queue aggregation This is another =E2=80=9Coffline=E2=80=9D aggregation type. Also, I believe= we=E2=80=99re talking about only time-based aggregation. In other words, w= e would aggregate values for users only with a well-defined time window. Th= ere won=E2=80=99t be a =E2=80=9Creal-time=E2=80=9D aggregation of values, s= imilar to the flow aggregation. > [Aggregation] Application/Flow/User/Queue Level Aggregations > ------------------------------------------------------------ > > Key: YARN-3815 > URL: https://issues.apache.org/jira/browse/YARN-3815 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > Attachments: Timeline Service Nextgen Flow, User, Queue Level Agg= regations (v1).pdf > > > Per previous discussions in some design documents for YARN-2928, the basi= c scenario is the query for stats can happen on: > - Application level, expect return: an application with aggregated stats > - Flow level, expect return: aggregated stats for a flow_run, flow_versio= n and flow=20 > - User level, expect return: aggregated stats for applications submitted = by user > - Queue level, expect return: aggregated stats for applications within th= e Queue > Application states is the basic building block for all other level aggreg= ations. We can provide Flow/User/Queue level aggregated statistics info bas= ed on application states (a dedicated table for application states is neede= d which is missing from previous design documents like HBase/Phoenix schema= design).=20 -- This message was sent by Atlassian JIRA (v6.3.4#6332)