hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3816) [Aggregation] App-level Aggregation for YARN system metrics
Date Wed, 26 Aug 2015 16:11:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14714416#comment-14714416
] 

Junping Du commented on YARN-3816:
----------------------------------

Thanks [~varun_saxena] for review and comments!
bq. If we use same scheme for long or double, we may end up with 4 ORs' for a single metric.
Maybe we can use cell tags for aggregation.
That's good point! When I was doing poc patch a few weeks ago, YARN-4053 haven't been bring
out to discussion so I thought it was a little overkill to use cell tag for specifying the
only boolean value. Now it seems to be a good way, but I would prefer to defer this decision
to YARN-4053 to address while there are other priority comments to address here so we can
move faster. What do you think?

bq. Maybe in TimelineCollector#aggregateMetrics, we should do aggregation only if the flag
is enabled.
That's true. That's part of reason why aggregation flag is added to metric. Will add check
in next patch.

bq. In TimelineCollector#appendAggregatedMetricsToEntities any reason we are creating separate
TimelineEntity objects for each metric ? Maybe create a single entity containing a set of
metrics.
Nice catch.

bq. 3 new maps have been introduced in TimelineCollector and these are used as base to calculate
aggregated value. What if the daemon crashes?
For RM, it could persistent maps to RMStateStore. For NM, it may not be enough as NM could
be lost also. We need a mechanism that if TimelineCollector is relaunched somewhere else,
it will read raw metrics and recover the maps before start to working. This will be part of
failed over JIRAs like: YARN-3115, YARN-3359, etc.

bq. In TimelineMetricCalculator some functions have duplicate if conditions for long.
Fixed.

bq. In TimelineMetricCalculator#sum, to avoid negative values due to overflow, we can change
conditions like below...
Like above comments, the overflow case will be handled in next patch.

bq. In TimelineMetric#aggregateTo, maybe use getValues instead of getValuesJAXB?
I would prefer to use TreeMap because it sort key (timestamp) when accessing it. aggregateTo()
algorithm assume metrics are sorted by timestamp.

bq. Also I was wondering if TimelineMetric#aggregateTo should be moved to some util class.
TimelineMetric is part of object model and exposed to client. And IIUC aggregateTo wont be
called by client.
As Li's mentioning below, it is a bit tricky to have utility class for any classes in API,
because it would mislead user to use it which is not our intension, at least for now. aggregateTo
is not straighfoward and generic useful like methods in TimelineMetricCalculator, so let's
hold on to expose it as utility class for now. Make it static sounds good though.

bq. What is EntityColumnPrefix#AGGREGATED_METRICS meant for?
It is something developed at poc stage a few weeks ago, and it should be removed after we
moving to ApplicationTable.

> [Aggregation] App-level Aggregation for YARN system metrics
> -----------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Application Level Aggregation of Timeline Data.pdf, YARN-3816-YARN-2928-v1.patch,
YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: resource (CPU,
Memory) consumption across all containers, number of containers launched/completed/failed,
etc. We need this for apps while they are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be aggregated to show
details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based on Application-level
aggregations rather than raw entity-level data as much less raws need to scan (with filter
out non-aggregated entities, like: events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message