Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Sat, 7 Feb 2015 00:05:38 +0000 (UTC)
From: "Sangjin Lee (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12759910.1417819944000.287248.1423267538146@Atlassian.JIRA>
In-Reply-To: <JIRA.12759910.1417819944000@Atlassian.JIRA>
References: <JIRA.12759910.1417819944000@Atlassian.JIRA>
 <JIRA.12759910.1417819944885@arcas>
Subject: [jira] [Commented] (YARN-2928) Application Timeline Server (ATS)
 next gen: phase 1
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310242#comment-14310242 ] 

Sangjin Lee commented on YARN-2928:
-----------------------------------

[~hitesh], continuing that discussion,

{quote}
[~vinodkv] Should have probably added more context from the design doc:
"We assume that the failure semantics of the ATS writer companion is the same as the AM. If the ATS writer companion fails for any reason, we try to bring it back up up to a specified number of times. If the maximum retries are exhausted, we consider it a fatal failure, and fail the application."
{quote}

Yes, I definitely could add more color to that point. I'm going to update the design doc as there are a number of clarifications made. Hopefully some time next week.

In the per-app timeline aggregator (a.k.a. ATS writer companion) model, it is a special container. And we need to be able to allocate both the timeline aggregator and the AM or neither. Also, we do want to be able to co-locate the AM and the aggregator on the same node. Then RM needs to negotiate that combined capacity atomically. In other words, we don't want to have a situation where we were able to allocate ATS but not AM, or vice versa. If AM needs 2 G, and the timeline aggregator needs 1 G, then this pair needs to go to a node on which 3 G can be allocated at that time.

In terms of the failure scenarios, we may need to hash out some more details. Since allocation is considered as a pair, it is also natural to consider their failure semantics in the same manner. But a deeper question is, if the AM came up but the timeline aggregator didn't come up (for resource reasons or otherwise), do we consider that an acceptable situation? If the timeline aggregator for that app cannot come up, should that be considered fatal? Or, if apps are running but they're not logging critical lifecycle events, etc. because the timeline aggregator went down, do we consider that situation acceptable? The discussion was that it is probably not acceptable as if it is a common occurrence, it would leave a large hole in the collected timeline data and the overall value of the timeline data goes down significantly.

That said, this point is deferred somewhat because initially we're starting out with a per-node aggregator option. The per-node aggregator option somewhat sidesteps (but not completely) this issue.

> Application Timeline Server (ATS) next gen: phase 1
> ---------------------------------------------------
>
>                 Key: YARN-2928
>                 URL: https://issues.apache.org/jira/browse/YARN-2928
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Priority: Critical
>         Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf, Data model proposal v1.pdf
>
>
> We have the application timeline server implemented in yarn per YARN-1530 and YARN-321. Although it is a great feature, we have recognized several critical issues and features that need to be addressed.
> This JIRA proposes the design and implementation changes to address those. This is phase 1 of this effort.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)