hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2928) Application Timeline Server (ATS) next gen: phase 1
Date Sat, 07 Feb 2015 00:05:38 GMT

    [ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310242#comment-14310242

Sangjin Lee commented on YARN-2928:

[~hitesh], continuing that discussion,

[~vinodkv] Should have probably added more context from the design doc:
"We assume that the failure semantics of the ATS writer companion is the same as the AM. If
the ATS writer companion fails for any reason, we try to bring it back up up to a specified
number of times. If the maximum retries are exhausted, we consider it a fatal failure, and
fail the application."

Yes, I definitely could add more color to that point. I'm going to update the design doc as
there are a number of clarifications made. Hopefully some time next week.

In the per-app timeline aggregator (a.k.a. ATS writer companion) model, it is a special container.
And we need to be able to allocate both the timeline aggregator and the AM or neither. Also,
we do want to be able to co-locate the AM and the aggregator on the same node. Then RM needs
to negotiate that combined capacity atomically. In other words, we don't want to have a situation
where we were able to allocate ATS but not AM, or vice versa. If AM needs 2 G, and the timeline
aggregator needs 1 G, then this pair needs to go to a node on which 3 G can be allocated at
that time.

In terms of the failure scenarios, we may need to hash out some more details. Since allocation
is considered as a pair, it is also natural to consider their failure semantics in the same
manner. But a deeper question is, if the AM came up but the timeline aggregator didn't come
up (for resource reasons or otherwise), do we consider that an acceptable situation? If the
timeline aggregator for that app cannot come up, should that be considered fatal? Or, if apps
are running but they're not logging critical lifecycle events, etc. because the timeline aggregator
went down, do we consider that situation acceptable? The discussion was that it is probably
not acceptable as if it is a common occurrence, it would leave a large hole in the collected
timeline data and the overall value of the timeline data goes down significantly.

That said, this point is deferred somewhat because initially we're starting out with a per-node
aggregator option. The per-node aggregator option somewhat sidesteps (but not completely)
this issue.

> Application Timeline Server (ATS) next gen: phase 1
> ---------------------------------------------------
>                 Key: YARN-2928
>                 URL: https://issues.apache.org/jira/browse/YARN-2928
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Priority: Critical
>         Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf, Data model proposal v1.pdf
> We have the application timeline server implemented in yarn per YARN-1530 and YARN-321.
Although it is a great feature, we have recognized several critical issues and features that
need to be addressed.
> This JIRA proposes the design and implementation changes to address those. This is phase
1 of this effort.

This message was sent by Atlassian JIRA

View raw message