hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
Date Thu, 18 Dec 2014 15:43:14 GMT

    [ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251818#comment-14251818
] 

Jason Lowe commented on YARN-2964:
----------------------------------

Thanks for the patch, Jian!  Findbug warnings appear to be unrelated.

I'm wondering about the change in the removeApplicationFromRenewal method or remove.  If a
sub-job completes, won't we remove the token from the allTokens map before the launcher job
has completed?  Then a subsequent sub-job that requests token cancelation can put the token
back in the map and cause the token to be canceled when it leaves.  I think we need to repeat
the logic from the original code before YARN-2704 here, i.e.: only remove the token if the
application ID matches.  That way the launcher job's token will remain _the_ token in that
collection until the launcher job completes.

This comment doesn't match the code, since the code looks like if any token wants to cancel
at the end then we will cancel at the end.
{code}
          // If any of the jobs sharing the same token set shouldCancelAtEnd
          // to true, we should not cancel the token.
          if (evt.shouldCancelAtEnd) {
            dttr.shouldCancelAtEnd = evt.shouldCancelAtEnd;
          }
{code}
I think the logic and comment should be if any job doesn't want to cancel then we won't cancel.
 The code seems to be trying to do the opposite, so I'm not sure how the unit test is passing.
 Maybe I'm missing something.

The info log message added in handleAppSubmitEvent also is misleading, as it says we are setting
shouldCancelAtEnd to whatever the event said, when in reality we only set it sometimes.  Probably
needs to be inside the conditional.

Wonder if we should be using a Set instead of a Map to track these tokens.  Adding an already
existing DelegationTokenToRenew in a set will not change the one already there, but with the
map a sub-job can clobber the DelegationTokenToRenew that's already there with its own when
it does the allTokens.put(dtr.token, dtr).

> RM prematurely cancels tokens for jobs that submit jobs (oozie)
> ---------------------------------------------------------------
>
>                 Key: YARN-2964
>                 URL: https://issues.apache.org/jira/browse/YARN-2964
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Assignee: Jian He
>            Priority: Blocker
>         Attachments: YARN-2964.1.patch
>
>
> The RM used to globally track the unique set of tokens for all apps.  It remembered the
first job that was submitted with the token.  The first job controlled the cancellation of
the token.  This prevented completion of sub-jobs from canceling tokens used by the main job.
> As of YARN-2704, the RM now tracks tokens on a per-app basis.  There is no notion of
the first/main job.  This results in sub-jobs canceling tokens and failing the main job and
other sub-jobs.  It also appears to schedule multiple redundant renewals.
> The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness
interval) after log aggregation completes.  The result is an oozie job, ex. pig, that will
launch many sub-jobs over time will fail if any sub-jobs are launched >10 min after any
sub-job completes.  If all other sub-jobs complete within that 10 min window, then the issue
goes unnoticed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message