hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
Date Tue, 22 Dec 2015 03:00:51 GMT

    [ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067480#comment-15067480

Jun Gong commented on YARN-3480:

Thanks for explaining.

These cases make removing attempts complex. We are removing attempts asynchronously. If RMStateStore
does not transit to 'FENCED' for failed operations, we might fail to remove some attempts
and succeed to remove other attempts, suppose there were 4 attempts: attempt01,  attempt02,
attempt03 and attempt04, we wanted to remove 2 attempts(attempt01 and attempt02), but we failed
to remove attempt01, then remain attempts are attempt01, attempt03 and attempt04. They are
not consistent. When recovering these attempts for RM restart, we will fail to recover attempts
because we could not recover attempt02.

To make things simple, how about just remove attempts if HA is enabled(or 'RMFailFast' is

> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch, YARN-3480.04.patch,
YARN-3480.05.patch, YARN-3480.06.patch, YARN-3480.07.patch, YARN-3480.08.patch, YARN-3480.09.patch,
YARN-3480.10.patch, YARN-3480.11.patch
> When RM HA is enabled and running containers are kept across attempts, apps are more
likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts'
larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make
RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small
value, retried attempts might be very large. So we need to delete some attempts stored in
RMStateStore and RMStateStore.

This message was sent by Atlassian JIRA

View raw message