hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts
Date Mon, 18 May 2015 02:50:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547473#comment-14547473

Vinod Kumar Vavilapalli commented on YARN-3480:

bq. we might need keep failed attempts those are in validity window, so it is the minimum
number of attempts that we should keep. So when apps specify how much they want the platform
to remember, we need consider it as another minimum number of attempts that we should keep.
What I proposed is a global limit on attempts-to-remember that can be overridden to a lower
value by individual apps if needed. So, yes, like you are saying, this global limit should
usually be such that RM can _atleast_ remember attempts that can happen in all apps' one failure-validity-interval.

bq. It makes recovery more fast, and does not lose any attempts' history. However it will
makes recovery process a little more complicated. The former method(removing attempts) is
more concise, and just likes logrotate, if we could accept the absence of some attempts' history
information, I would prefer it.
Without doing this, we will unnecessarily be forcing apps to lose history simply because the
platform cannot recover quickly enough.

Thinking more, how about we only have (limits + asynchronous recovery) for services, once
YARN-1039 goes in? Non-service apps anyways are not expected to have a lot of app-attempts.

> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch, YARN-3480.04.patch
> When RM HA is enabled and running containers are kept across attempts, apps are more
likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts'
larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make
RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small
value, retried attempts might be very large. So we need to delete some attempts stored in
RMStateStore and RMStateStore.

This message was sent by Atlassian JIRA

View raw message