hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijie Shen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
Date Mon, 01 Sep 2014 06:28:22 GMT

    [ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117073#comment-14117073

Zhijie Shen commented on YARN-611:

[~xgong], thanks for working on this issue, and I have a couple of comments upon the latest

1. *API Change*: I'm not sure whether it is really necessary to have a completely standalone
proto messages for ApplicationRetryPolicy's implementations. It sounds an overkill to me.
In fact, MaxApplicationRetriesPolicy seems to be a special case of WindowedApplicationRetriesPolicy,
where the window size is to be infinitely large, such that the number of failures will never
be reset. Therefore, why not simply adding one more field (i.e., resetTimeWindow) in ApplicationSubmissionContext.
When  resetTimeWindow = 0 or -1, it means the window size is unbounded, and failure number
will never be reset. On the other side, when resetTimeWindow is set to > 0, the failure
number will no take the failures which happen out of the window into account.

Moreover, a minor issue here is that ApplicationRetryPolicy is actually not a real abstraction,
which has the flags of both implementations's context.

2. *Failure Window*: If I understood correctly, WindowedApplicationRetriesPolicy uses a jumped
window instead of a *moving* window. It may be problematic. Here's the example. Let's say
the window size is 2H, and the maxAttempts is 100. From 0:00 to 1:00, there happened 1 failure.
From 1:00 to 2:00, there happened 98 failures. At 2:00 the reset logic is triggered, such
that all the 99 failures won't be taken into account any more. From 2:00 to 3:00, there happened
2 failures. The total failures at this time is 2, because the previous 99 failures have been
reset. However, from the point of view at 3:00, looking back to the 2H window, 101 failures
have happened. In fact, the job should run out of retry quotas at this point.

IMHO, the reasonable way is to make use a certain data structure (e.g., fixed-size FIFO queue)
to always keep tracking the number failures that happened in past configured time window,
and update the data structure upon a failure happens.

3. *Multi-threading*: I'm not sure whether it is going to work for a big cluster with hundreds
of even thousands concurrent applications to have an individual thread to reset the failure
number. Though WindowedApplicationRetriesPolicy is particularly designed for the long running
services, I don't think we have restricted the normal application to use it, and it's not
reasonable to make this restriction. Therefore, it's likely to have that many threads for
an RM if all apps choose to use this policy. However, AFAIK, the number of threads in a process
is limited. Importantly, the reset logic is not computation intensive, such that it's wasting
thread resources to have one for each app.

Maybe we can make use a thread pool, or even have a single thread (e.g., a service of RM)
to take care of all the apps' reset windows. Moreover, IMHO, if the aforementioned data structure
is defined properly, we may not need to have a separate thread to the reset work, as the failure
number in the past time of the configured window size is updated every time the failure happens.

4. *Affecting RMStateStore*: I'm not sure why it is necessary to persist the "end time" into
RMStateStore, which seems not to be really used for reseting the window. One think I can image
about RM restarting is how to store the failure number in the past time of the configured
window size, if we want to make sure after RM restarting, RM is still able to trace back to
the whole past time window for the failure number. But I think we can do it separately.

> Add an AM retry count reset window to YARN RM
> ---------------------------------------------
>                 Key: YARN-611
>                 URL: https://issues.apache.org/jira/browse/YARN-611
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.3-alpha
>            Reporter: Chris Riccomini
>            Assignee: Xuan Gong
>         Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, YARN-611.4.patch,
YARN-611.4.rebase.patch, YARN-611.5.patch
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM before failing
the whole YARN job. YARN counts an AM as failed if the node that it was running on dies (the
NM will timeout, which counts as a failure for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) YARN jobs,
since the machine (or NM) that the AM is running on will eventually need to be restarted (or
the machine/NM will fail). In such an event, the AM has not done anything wrong, but this
is counted as a "failure" by the RM. Since the retry count for the AM is never reset, eventually,
at some point, the number of machine/NM failures will result in the AM failure count going
above the configured value for yarn.resourcemanager.am.max-retries. Once this happens, the
RM will mark the job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM is "well
behaved", and it's safe to reset its failure count back to zero. Every time an AM fails the
RmAppImpl would check the last time that the AM failed. If the last failure was less than
retry-count-window-ms ago, and the new failure count is > max-retries, then the job should
fail. If the AM has never failed, the retry count is < max-retries, or if the last failure
was OUTSIDE the retry-count-window-ms, then the job should be restarted. Additionally, if
the last failure was outside the retry-count-window-ms, then the failure count should be set
back to 0.
> This would give developers a way to have well-behaved AMs run forever, while still failing
mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look at app.attempts,
and see if there have been more than max-retries failures in the last retry-count-window-ms
milliseconds. If there have, then the job should fail, if not, then the job should go forward.
Additionally, we might also need to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent,
so that the RmAppImpl can check the time of the failure.
> Thoughts?

This message was sent by Atlassian JIRA

View raw message