Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Wed, 16 Sep 2015 19:45:46 +0000 (UTC)
From: "Junping Du (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12838495.1434552998000.358871.1442432746877@Atlassian.JIRA>
In-Reply-To: <JIRA.12838495.1434552998000@Atlassian.JIRA>
References: <JIRA.12838495.1434552998000@Atlassian.JIRA>
 <JIRA.12838495.1434552998098@arcas>
Subject: [jira] [Commented] (YARN-3816) [Aggregation] App-level aggregation
 and accumulation for YARN system metrics
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/YARN-3816?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14791=
012#comment-14791012 ]=20

Junping Du commented on YARN-3816:
----------------------------------

Thanks Naga, Varun and Li for review and comments! Let me address them one =
by one.=20
First for Naga's comments:
bq. Following are not completely achieved right? number of containers launc=
hed/completed/failed, framework specific metrics, e.g. HDFS_BYTES_READ, sho=
uld be aggregated to show details of states in framework level.
We are almost there. number of containers should be an existing info which =
get addressed in YARN-3880. Also, framework specific metrics is another top=
ic and we were still discussing different requirements for MapReduce and ot=
her apps which is out of scope of this JIRA - that's why we have YARN syste=
m metrics in the title.

bq. In the doc, ApplicationState Table (aggregated from AppLevelTimelineCol=
lector=E2=80=8B) has Container Aggregate metrics (allocated: 0 preempted:0 =
failed: 0 reuse: 0 ) is this req @ AppLevelTimelineCollector=E2=80=8B felt =
it should be only @ aggregated from =E2=80=8BRMTimelineCollector. Also time=
(start: last_modification: avg_execution ) is required as metric? may be i =
misread the table description?
Like said above, YARN-3880 is supposed to track container number metrics. M=
ay be we can move discussion there?

bq. In the doc aggregation-design-discussion.pdf, you had mentioned that ti=
me average & max is what will be considered, but in the patch it seems more=
 like only SUM is supported neither avg or max, so is sum more imp than the=
 other(or am i missing something) ? Also would like to know the significanc=
e of this measurement as i felt per=E2=80=90container average more helpful =
as it can be useful for calibrating RM.
We had a previous discussion before and we choose SUM as the first operatio=
n to support on aggregating metrics. There are definetely other operations =
that are useful that we could add and extend later.

bq. IIUC Based on the current design aggregation seems to be happening @ th=
e collector end. in that case do we require TimelineWriter.aggregate(Timeli=
neEntity data, TimelineAggregationTrack track) ? Is there any idea to push =
some logic to writer for aggregation?
No. App aggregation is per collector but not per writer as currently we are=
 sharing a single writer on NM for all app collector. I would prefer to mak=
e each collector thread to maintain their own states and calculation.

bq. TimelineAggregationBasis doesnt have value for queue, as this is used i=
n TimelineReaderWebServices, inst it required for reader?
If my understanding is correct, queue info is not a must for app entity I t=
hink. We only require flow info, etc. However, I will do double check on re=
ader side for this.

bq. will it be required to accumulate time series data with single value da=
ta and viceversa ? would accumulation need to be done on same type ? if not=
 some real scenarios where it can be possibly happen.
In toAccumulate, we support accumulate time series data on a single value d=
ata (basis data) because we can assume basis data is always single value da=
ta which comes from last time accumulation result. If there are scenarios t=
hat we want accumulated result to be time series data, then we can have a s=
eparated method to extend later. Make sense?

bq. Would it be better to have set of operation which can be performed in T=
imelineMetric so that accumulateTo automatically detect and accumulate for =
diff operations ? currently it seems like statically set to SUM in Timeline=
Collecor.
We support SUM and REP (replace) already. Like above comments, we can add m=
ore operations later with more specific requirement.

bq. Currently for each putEntity call in collector we are not only aggregat=
ing & invoking accumulateTo but also sending it to be written to the writer=
, but in the doc its mentioned that it will cache for 15 seconds and then u=
pdate right?
No. We were choosing to aggregate and accumulate (can be disabled by config=
uration) immediately like current implementation. The previous concern is f=
or performance delay but it sounds unnecessary now. We can rethink on this =
if we meet perf bottleneck for this in future.

bq. Not sure earlier why was pid added for a container cpu and mem usage me=
tric and not sure why we are removing it. But seems like for a given contai=
ner we do not req pid to be appended as it will be unique to it. is that th=
e reason we are removing it?
Pid is added wrongly previously as this info is useless: The outer side of =
TimelineEnity (container entity) already have container id which make this =
metrics unique enough. And we need metric ID to keep the same type (CPU, Me=
mory, etc) for aggregation and accumulation.

bq. do we need to set aggregateTo to true for container metrics(cputotalCor=
e% & pmemUsage) to ? also we are currently not capturing vmemUsage do we ne=
ed to capture it?
We choose to record these two metrics only in previous JIRAs (like YARN-304=
5). May be we can keep to follow this and add more metrics later if necessa=
ry?

bq. In the Doc its mentioned we are going to split the table "ApplicationSt=
ate table" into 2 It can be split into two tables by aggregated from RMTime=
lineCollector or AppLevelTimelineCollector, is it req?
I think this may not be a must requirement that we have to split into two t=
ables. But I suggest we can revisit this in YARN-3880 for putting some info=
 from RMTimelineCollector. Here we don't have to worry as all aggregated in=
fo are from AppLevelTimelineCollector.

bq. yarn.timeline-service.aggregation.accumulation.enabled can have default=
 value to be explicitly set as true in yarn-default.xml as per the default =
value in yarn config.
Ok. will do.

bq. in TestTimelineMetric.testAccumulationOnTimelineMetrics assertEquals ex=
pected value should come as first arg and the actual expression as next. wh=
en it fails exception msg will come wrong. also unused import in that class
Nice catch! Will fix it in next patch.

bq. 2 static methods of TimelineCollector.aggregateMetrics(TimelineEntities=
) are public are they planned to be used some other class? if not we can ma=
ke it private. Also aggregateMetrics returns a map, can it be a List/Set wh=
ich would suffice for appendAggregatedMetricsToEntities
Sounds good. Let's narrow the visibility and make it a SET instead. =20

bq. EntityColumnPrefix.AGGREGATED_METRICS is not used anywhere, is it req?
Will remove it.

bq. Trying to create a setup and test the patch in the cluster, if i come a=
cross more queries will inform.
Cool. That would helps. Thanks!

I will reply Varun and Li's comments in next segment as now it is pretty lo=
ng enough.

> [Aggregation] App-level aggregation and accumulation for YARN system metr=
ics
> -------------------------------------------------------------------------=
---
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Application Level Aggregation of Timeline Data.pdf, =
YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, YARN-3816-YAR=
N-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, YARN-3816-YARN-2928-v2.p=
atch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: re=
source (CPU, Memory) consumption across all containers, number of container=
s launched/completed/failed, etc. We need this for apps while they are runn=
ing as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggre=
gated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be b=
ased on Application-level aggregations rather than raw entity-level data as=
 much less raws need to scan (with filter out non-aggregated entities, like=
: events, configurations, etc.).


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)