Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Thu, 18 Dec 2014 15:43:14 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12761893.1418684947000.58557.1418917394474@Atlassian.JIRA>
In-Reply-To: <JIRA.12761893.1418684947000@Atlassian.JIRA>
References: <JIRA.12761893.1418684947000@Atlassian.JIRA>
 <JIRA.12761893.1418684947653@arcas>
Subject: [jira] [Commented] (YARN-2964) RM prematurely cancels tokens for
 jobs that submit jobs (oozie)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251818#comment-14251818 ] 

Jason Lowe commented on YARN-2964:
----------------------------------

Thanks for the patch, Jian!  Findbug warnings appear to be unrelated.

I'm wondering about the change in the removeApplicationFromRenewal method or remove.  If a sub-job completes, won't we remove the token from the allTokens map before the launcher job has completed?  Then a subsequent sub-job that requests token cancelation can put the token back in the map and cause the token to be canceled when it leaves.  I think we need to repeat the logic from the original code before YARN-2704 here, i.e.: only remove the token if the application ID matches.  That way the launcher job's token will remain _the_ token in that collection until the launcher job completes.

This comment doesn't match the code, since the code looks like if any token wants to cancel at the end then we will cancel at the end.
{code}
          // If any of the jobs sharing the same token set shouldCancelAtEnd
          // to true, we should not cancel the token.
          if (evt.shouldCancelAtEnd) {
            dttr.shouldCancelAtEnd = evt.shouldCancelAtEnd;
          }
{code}
I think the logic and comment should be if any job doesn't want to cancel then we won't cancel.  The code seems to be trying to do the opposite, so I'm not sure how the unit test is passing.  Maybe I'm missing something.

The info log message added in handleAppSubmitEvent also is misleading, as it says we are setting shouldCancelAtEnd to whatever the event said, when in reality we only set it sometimes.  Probably needs to be inside the conditional.

Wonder if we should be using a Set instead of a Map to track these tokens.  Adding an already existing DelegationTokenToRenew in a set will not change the one already there, but with the map a sub-job can clobber the DelegationTokenToRenew that's already there with its own when it does the allTokens.put(dtr.token, dtr).

> RM prematurely cancels tokens for jobs that submit jobs (oozie)
> ---------------------------------------------------------------
>
>                 Key: YARN-2964
>                 URL: https://issues.apache.org/jira/browse/YARN-2964
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Assignee: Jian He
>            Priority: Blocker
>         Attachments: YARN-2964.1.patch
>
>
> The RM used to globally track the unique set of tokens for all apps.  It remembered the first job that was submitted with the token.  The first job controlled the cancellation of the token.  This prevented completion of sub-jobs from canceling tokens used by the main job.
> As of YARN-2704, the RM now tracks tokens on a per-app basis.  There is no notion of the first/main job.  This results in sub-jobs canceling tokens and failing the main job and other sub-jobs.  It also appears to schedule multiple redundant renewals.
> The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes.  The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched >10 min after any sub-job completes.  If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)