hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2964) RM prematurely cancels tokens for jobs that submit jobs (oozie)
Date Thu, 18 Dec 2014 20:49:14 GMT

    [ https://issues.apache.org/jira/browse/YARN-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252218#comment-14252218
] 

Jason Lowe commented on YARN-2964:
----------------------------------

bq. If launcher job first gets added to the appTokens map, DelegationTokenRenewer will not
add DelegationTokenToRenew instance for the sub-job.

Ah, sorry, I missed this critical change from the original patch.  However if we don't add
the delegation token for each sub-job then I think we have a problem with the following use-case:

# Oozie launcher submits a MapReduce sub-job
# MapReduce job starts
# Oozie launcher job leaves
# MapReduce job now running with a token that the RM has "forgotten" and won't be automatically
renewed

We might have had the same issue in this case prior to YARN-2704, since the token would be
pulled from the set when the launcher completed.

> RM prematurely cancels tokens for jobs that submit jobs (oozie)
> ---------------------------------------------------------------
>
>                 Key: YARN-2964
>                 URL: https://issues.apache.org/jira/browse/YARN-2964
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Assignee: Jian He
>            Priority: Blocker
>         Attachments: YARN-2964.1.patch, YARN-2964.2.patch
>
>
> The RM used to globally track the unique set of tokens for all apps.  It remembered the
first job that was submitted with the token.  The first job controlled the cancellation of
the token.  This prevented completion of sub-jobs from canceling tokens used by the main job.
> As of YARN-2704, the RM now tracks tokens on a per-app basis.  There is no notion of
the first/main job.  This results in sub-jobs canceling tokens and failing the main job and
other sub-jobs.  It also appears to schedule multiple redundant renewals.
> The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness
interval) after log aggregation completes.  The result is an oozie job, ex. pig, that will
launch many sub-jobs over time will fail if any sub-jobs are launched >10 min after any
sub-job completes.  If all other sub-jobs complete within that 10 min window, then the issue
goes unnoticed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message