Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 016A7185F4 for ; Wed, 16 Sep 2015 19:46:03 +0000 (UTC) Received: (qmail 63620 invoked by uid 500); 16 Sep 2015 19:45:47 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 63575 invoked by uid 500); 16 Sep 2015 19:45:47 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 63560 invoked by uid 99); 16 Sep 2015 19:45:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Sep 2015 19:45:46 +0000 Date: Wed, 16 Sep 2015 19:45:46 +0000 (UTC) From: "Junping Du (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation and accumulation for YARN system metrics MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3816?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14791= 012#comment-14791012 ]=20 Junping Du commented on YARN-3816: ---------------------------------- Thanks Naga, Varun and Li for review and comments! Let me address them one = by one.=20 First for Naga's comments: bq. Following are not completely achieved right? number of containers launc= hed/completed/failed, framework specific metrics, e.g. HDFS_BYTES_READ, sho= uld be aggregated to show details of states in framework level. We are almost there. number of containers should be an existing info which = get addressed in YARN-3880. Also, framework specific metrics is another top= ic and we were still discussing different requirements for MapReduce and ot= her apps which is out of scope of this JIRA - that's why we have YARN syste= m metrics in the title. bq. In the doc, ApplicationState Table (aggregated from AppLevelTimelineCol= lector=E2=80=8B) has Container Aggregate metrics (allocated: 0 preempted:0 = failed: 0 reuse: 0 ) is this req @ AppLevelTimelineCollector=E2=80=8B felt = it should be only @ aggregated from =E2=80=8BRMTimelineCollector. Also time= (start: last_modification: avg_execution ) is required as metric? may be i = misread the table description? Like said above, YARN-3880 is supposed to track container number metrics. M= ay be we can move discussion there? bq. In the doc aggregation-design-discussion.pdf, you had mentioned that ti= me average & max is what will be considered, but in the patch it seems more= like only SUM is supported neither avg or max, so is sum more imp than the= other(or am i missing something) ? Also would like to know the significanc= e of this measurement as i felt per=E2=80=90container average more helpful = as it can be useful for calibrating RM. We had a previous discussion before and we choose SUM as the first operatio= n to support on aggregating metrics. There are definetely other operations = that are useful that we could add and extend later. bq. IIUC Based on the current design aggregation seems to be happening @ th= e collector end. in that case do we require TimelineWriter.aggregate(Timeli= neEntity data, TimelineAggregationTrack track) ? Is there any idea to push = some logic to writer for aggregation? No. App aggregation is per collector but not per writer as currently we are= sharing a single writer on NM for all app collector. I would prefer to mak= e each collector thread to maintain their own states and calculation. bq. TimelineAggregationBasis doesnt have value for queue, as this is used i= n TimelineReaderWebServices, inst it required for reader? If my understanding is correct, queue info is not a must for app entity I t= hink. We only require flow info, etc. However, I will do double check on re= ader side for this. bq. will it be required to accumulate time series data with single value da= ta and viceversa ? would accumulation need to be done on same type ? if not= some real scenarios where it can be possibly happen. In toAccumulate, we support accumulate time series data on a single value d= ata (basis data) because we can assume basis data is always single value da= ta which comes from last time accumulation result. If there are scenarios t= hat we want accumulated result to be time series data, then we can have a s= eparated method to extend later. Make sense? bq. Would it be better to have set of operation which can be performed in T= imelineMetric so that accumulateTo automatically detect and accumulate for = diff operations ? currently it seems like statically set to SUM in Timeline= Collecor. We support SUM and REP (replace) already. Like above comments, we can add m= ore operations later with more specific requirement. bq. Currently for each putEntity call in collector we are not only aggregat= ing & invoking accumulateTo but also sending it to be written to the writer= , but in the doc its mentioned that it will cache for 15 seconds and then u= pdate right? No. We were choosing to aggregate and accumulate (can be disabled by config= uration) immediately like current implementation. The previous concern is f= or performance delay but it sounds unnecessary now. We can rethink on this = if we meet perf bottleneck for this in future. bq. Not sure earlier why was pid added for a container cpu and mem usage me= tric and not sure why we are removing it. But seems like for a given contai= ner we do not req pid to be appended as it will be unique to it. is that th= e reason we are removing it? Pid is added wrongly previously as this info is useless: The outer side of = TimelineEnity (container entity) already have container id which make this = metrics unique enough. And we need metric ID to keep the same type (CPU, Me= mory, etc) for aggregation and accumulation. bq. do we need to set aggregateTo to true for container metrics(cputotalCor= e% & pmemUsage) to ? also we are currently not capturing vmemUsage do we ne= ed to capture it? We choose to record these two metrics only in previous JIRAs (like YARN-304= 5). May be we can keep to follow this and add more metrics later if necessa= ry? bq. In the Doc its mentioned we are going to split the table "ApplicationSt= ate table" into 2 It can be split into two tables by aggregated from RMTime= lineCollector or AppLevelTimelineCollector, is it req? I think this may not be a must requirement that we have to split into two t= ables. But I suggest we can revisit this in YARN-3880 for putting some info= from RMTimelineCollector. Here we don't have to worry as all aggregated in= fo are from AppLevelTimelineCollector. bq. yarn.timeline-service.aggregation.accumulation.enabled can have default= value to be explicitly set as true in yarn-default.xml as per the default = value in yarn config. Ok. will do. bq. in TestTimelineMetric.testAccumulationOnTimelineMetrics assertEquals ex= pected value should come as first arg and the actual expression as next. wh= en it fails exception msg will come wrong. also unused import in that class Nice catch! Will fix it in next patch. bq. 2 static methods of TimelineCollector.aggregateMetrics(TimelineEntities= ) are public are they planned to be used some other class? if not we can ma= ke it private. Also aggregateMetrics returns a map, can it be a List/Set wh= ich would suffice for appendAggregatedMetricsToEntities Sounds good. Let's narrow the visibility and make it a SET instead. =20 bq. EntityColumnPrefix.AGGREGATED_METRICS is not used anywhere, is it req? Will remove it. bq. Trying to create a setup and test the patch in the cluster, if i come a= cross more queries will inform. Cool. That would helps. Thanks! I will reply Varun and Li's comments in next segment as now it is pretty lo= ng enough. > [Aggregation] App-level aggregation and accumulation for YARN system metr= ics > -------------------------------------------------------------------------= --- > > Key: YARN-3816 > URL: https://issues.apache.org/jira/browse/YARN-3816 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Reporter: Junping Du > Assignee: Junping Du > Attachments: Application Level Aggregation of Timeline Data.pdf, = YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, YARN-3816-YAR= N-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, YARN-3816-YARN-2928-v2.p= atch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch > > > We need application level aggregation of Timeline data: > - To present end user aggregated states for each application, include: re= source (CPU, Memory) consumption across all containers, number of container= s launched/completed/failed, etc. We need this for apps while they are runn= ing as well as when they are done. > - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggre= gated to show details of states in framework level. > - Other level (Flow/User/Queue) aggregation can be more efficient to be b= ased on Application-level aggregations rather than raw entity-level data as= much less raws need to scan (with filter out non-aggregated entities, like= : events, configurations, etc.). -- This message was sent by Atlassian JIRA (v6.3.4#6332)