hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2019) Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
Date Tue, 21 Jul 2015 10:58:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634951#comment-14634951

Junping Du commented on YARN-2019:

+1 on general idea of YARN-3607. However, here users may have three options actually when
facing error of ZKRMStateStore:
1. aggressive to fail RM daemon;
2. conservative to only log these errors without failed RM daemon and any applications;
3. relative conservative - not failed RM but failed application in some cases (like RM get
These choices may hint we may not want to force the policy of handling on all failures into
a single configuration, although I agree we should combine/consolidate them as many as possible
like what proposed by YARN-3607. 
Particularly in this case, I may prefer to add a separated configuration (may be something
like: a boolean value for "yarn.resourcemanager.state-store.exit-on-error" or an enum value
for "yarn.resourcemanager.state-store.policy-on-error"?) to allow user to choose when facing
RM state store failures. So user got other options for other failure cases.

> Retrospect on decision of making RM crashed if any exception throw in ZKRMStateStore
> ------------------------------------------------------------------------------------
>                 Key: YARN-2019
>                 URL: https://issues.apache.org/jira/browse/YARN-2019
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Jian He
>            Priority: Critical
>              Labels: ha
>         Attachments: YARN-2019.1-wip.patch
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal exception
to crash RM down. As shown in YARN-1924, it could due to RM HA internal bug itself, but not
fatal exception. We should retrospect some decision here as HA feature is designed to protect
key component but not disturb it.

This message was sent by Atlassian JIRA

View raw message