hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
Date Tue, 05 May 2015 03:34:08 GMT

    [ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527847#comment-14527847

Jun Gong commented on YARN-3480:

[~jianhe], sorry for not specifying our scenario: RM HA is enabled, use ZK to store apps'
info, most apps running in the cluster are long running(service) apps, yarn.resourcemanager.am.max-attempts
is set to 10000 because we have not patched YARN-611 and we want apps to retry more times.
 There are 10K apps with 1~10000 attempts stored in ZK. It will take about 6 mins to recover
those apps when RM HA.

1. How often do you see an app failed with a large number of attempts? If it's limited to
a few apps. I wouldn't worry so much.
2. How slower it is in reality in your case? we've done some benchmark, recovering 10k apps(with
1 attempt) on ZK is pretty fast, within 20 seconds or so.
Please see above. I think it will be OK for map-reduce jobs. But it might not be OK for service
apps which have been running several months.

3. Limiting the attempts to be recorded means we are losing history. it's a trade off.
Yes, I agree.

> Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
> ----------------------------------------------------------------------------
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch
> When RM HA is enabled and running containers are kept across attempts, apps are more
likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts'
larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make
RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small
value, retried attempts might be very large. So we need to delete some attempts stored in
RMStateStore and RMStateStore.

This message was sent by Atlassian JIRA

View raw message