hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
Date Tue, 08 Jul 2014 21:55:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055588#comment-14055588

Robert Joseph Evans commented on YARN-611:

Why is the reset policy created on a per app ATTEMPT basis? Shouldn't it be on a per application
basis.  Wouldn't having more then one WindowsSlideAMRetryCountResetPolicy per application
be a waste as they will either be running in parallel racing with each other, or there will
be extra overhead to stop and start them for each application attempt?

Inside WindowsSlideAMRetryCountResetPolicy you create a new Timer.  Timer instances create
a new thread, I am not sure we really need a new thread for potentially each application,
just so the thread can wakeup every few seconds to reset a counter.

Inside WindowsSlideAMRetryCountResetPolicy.amRetryCountReset we call rmApp.getCurrentAppAttempt()
in a loop.  Why don't we cache it?

I also don't really like how the code handles locking.  To me it always feels bad to hold
a lock while calling into a class that may call back into you, especially from a different
thread.  The WindowsSlideAMRetryCountResetPolicy calls into getAppAttemptId, shouldCountTowardsMaxAttemptRetry,
mayBeLastAttempt, and setMaybeLastAttemptFlag of RmAppAttemptImpl. RmAppAttemptImpl calls
into start, stop, and recover for the resetPolicy.  Right now I don't think there are any
potential deadlocks because RmAppAttemptImpl never holds a lock while interacting directly
with resetPolicy, but if it ever does then it could deadlock.  I'm not sure of a good way
to fix this, except perhaps through comments in the ResetPolicy interface specifying that
start/stop/recover will never be called while holding a lock for RMAppAttempt or RMApp.

> Add an AM retry count reset window to YARN RM
> ---------------------------------------------
>                 Key: YARN-611
>                 URL: https://issues.apache.org/jira/browse/YARN-611
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.3-alpha
>            Reporter: Chris Riccomini
>            Assignee: Xuan Gong
>         Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM before failing
the whole YARN job. YARN counts an AM as failed if the node that it was running on dies (the
NM will timeout, which counts as a failure for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) YARN jobs,
since the machine (or NM) that the AM is running on will eventually need to be restarted (or
the machine/NM will fail). In such an event, the AM has not done anything wrong, but this
is counted as a "failure" by the RM. Since the retry count for the AM is never reset, eventually,
at some point, the number of machine/NM failures will result in the AM failure count going
above the configured value for yarn.resourcemanager.am.max-retries. Once this happens, the
RM will mark the job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM is "well
behaved", and it's safe to reset its failure count back to zero. Every time an AM fails the
RmAppImpl would check the last time that the AM failed. If the last failure was less than
retry-count-window-ms ago, and the new failure count is > max-retries, then the job should
fail. If the AM has never failed, the retry count is < max-retries, or if the last failure
was OUTSIDE the retry-count-window-ms, then the job should be restarted. Additionally, if
the last failure was outside the retry-count-window-ms, then the failure count should be set
back to 0.
> This would give developers a way to have well-behaved AMs run forever, while still failing
mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look at app.attempts,
and see if there have been more than max-retries failures in the last retry-count-window-ms
milliseconds. If there have, then the job should fail, if not, then the job should go forward.
Additionally, we might also need to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent,
so that the RmAppImpl can check the time of the failure.
> Thoughts?

This message was sent by Atlassian JIRA

View raw message