hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
Date Wed, 08 Apr 2015 19:52:14 GMT

    [ https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485890#comment-14485890

Daryn Sharp commented on YARN-3055:

The renew at job submission isn't the problem.  It's actually very desirable.  Years back,
a job submitted with bad tokens - that was destined to fail - would be launched anyway.  The
tasks failed to connect, ipc level retries occurred, then higher level retries occurred, and
yarn generally caught all exceptions and retried.  Tasks were retried, perhaps the app attempt
was retried, etc.  In the end, a job that _clearly was going to fail_ might tie up cluster
resources for 20+ minutes.  Why was it launched when a failed renew could have prevented the
problem?  Not to mention the renewer was hardcoded to assume the expiration interval was 24h...
 So much for being able to stress test the renewer with <1m expirations.

The potential DOS problem is when a token has reached end of life expiration.  Let's say the
token can be renewed twice.    The third and subsequent renews return the same expiration.
# t1 = submit + renew
# t2 = t1 + renew
# t3 = t2
# t4 = t2

The renew timers fire 90% of the delta between now and the next expiration.  So as end of
life expiration approaches, the timer fires with an increasing frequency.  50 threads doing
that virtually non-stop would not be pretty.  The solution is stop renewing when the next
expiration equals the last expiration.  That can be addressed in another jira that's not a
blocker because if tokens aren't renewed forever then it's a rare situation.

> The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
> ------------------------------------------------------------------------------------------
>                 Key: YARN-3055
>                 URL: https://issues.apache.org/jira/browse/YARN-3055
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: security
>            Reporter: Yi Liu
>            Assignee: Yi Liu
>            Priority: Blocker
>         Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch
> After YARN-2964, there is only one timer to renew the token if it's shared by jobs. 
> In {{removeApplicationFromRenewal}}, when going to remove a token, and the token is shared
by other jobs, we will not cancel the token. 
> Meanwhile, we should not cancel the _timerTask_, also we should not remove it from {{allTokens}}.
Otherwise for the existing submitted applications which share this token will not get renew
any more, and for new submitted applications which share this token, the token will be renew
> For example, we have 3 applications: app1, app2, app3. And they share the token1. See
following scenario:
> *1).* app1 is submitted firstly, then app2, and then app3. In this case, there is only
one token renewal timer for token1, and is scheduled when app1 is submitted
> *2).* app1 is finished, then the renewal timer is cancelled. token1 will not be renewed
any more, but app2 and app3 still use it, so there is problem.

This message was sent by Atlassian JIRA

View raw message