Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Wed, 26 Aug 2015 16:11:46 +0000 (UTC)
From: "Junping Du (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12838495.1434552998000.172176.1440605506483@Atlassian.JIRA>
In-Reply-To: <JIRA.12838495.1434552998000@Atlassian.JIRA>
References: <JIRA.12838495.1434552998000@Atlassian.JIRA>
 <JIRA.12838495.1434552998098@arcas>
Subject: [jira] [Commented] (YARN-3816) [Aggregation] App-level Aggregation
 for YARN system metrics
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14714416#comment-14714416 ] 

Junping Du commented on YARN-3816:
----------------------------------

Thanks [~varun_saxena] for review and comments!
bq. If we use same scheme for long or double, we may end up with 4 ORs' for a single metric. Maybe we can use cell tags for aggregation.
That's good point! When I was doing poc patch a few weeks ago, YARN-4053 haven't been bring out to discussion so I thought it was a little overkill to use cell tag for specifying the only boolean value. Now it seems to be a good way, but I would prefer to defer this decision to YARN-4053 to address while there are other priority comments to address here so we can move faster. What do you think?

bq. Maybe in TimelineCollector#aggregateMetrics, we should do aggregation only if the flag is enabled.
That's true. That's part of reason why aggregation flag is added to metric. Will add check in next patch.

bq. In TimelineCollector#appendAggregatedMetricsToEntities any reason we are creating separate TimelineEntity objects for each metric ? Maybe create a single entity containing a set of metrics.
Nice catch.

bq. 3 new maps have been introduced in TimelineCollector and these are used as base to calculate aggregated value. What if the daemon crashes?
For RM, it could persistent maps to RMStateStore. For NM, it may not be enough as NM could be lost also. We need a mechanism that if TimelineCollector is relaunched somewhere else, it will read raw metrics and recover the maps before start to working. This will be part of failed over JIRAs like: YARN-3115, YARN-3359, etc.

bq. In TimelineMetricCalculator some functions have duplicate if conditions for long.
Fixed.

bq. In TimelineMetricCalculator#sum, to avoid negative values due to overflow, we can change conditions like below...
Like above comments, the overflow case will be handled in next patch.

bq. In TimelineMetric#aggregateTo, maybe use getValues instead of getValuesJAXB?
I would prefer to use TreeMap because it sort key (timestamp) when accessing it. aggregateTo() algorithm assume metrics are sorted by timestamp.

bq. Also I was wondering if TimelineMetric#aggregateTo should be moved to some util class. TimelineMetric is part of object model and exposed to client. And IIUC aggregateTo wont be called by client.
As Li's mentioning below, it is a bit tricky to have utility class for any classes in API, because it would mislead user to use it which is not our intension, at least for now. aggregateTo is not straighfoward and generic useful like methods in TimelineMetricCalculator, so let's hold on to expose it as utility class for now. Make it static sounds good though.

bq. What is EntityColumnPrefix#AGGREGATED_METRICS meant for?
It is something developed at poc stage a few weeks ago, and it should be removed after we moving to ApplicationTable.

> [Aggregation] App-level Aggregation for YARN system metrics
> -----------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Application Level Aggregation of Timeline Data.pdf, YARN-3816-YARN-2928-v1.patch, YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: resource (CPU, Memory) consumption across all containers, number of containers launched/completed/failed, etc. We need this for apps while they are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based on Application-level aggregations rather than raw entity-level data as much less raws need to scan (with filter out non-aggregated entities, like: events, configurations, etc.).


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)