hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4087) Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs
Date Tue, 08 Sep 2015 01:19:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734102#comment-14734102

Bikas Saha commented on YARN-4087:

Repeating my comments from YARN-2019 here 
There would be 2 kinds of state store operations - reads and writes. Writes may be of 2 kinds
- critical and non-critical. E.g. saving an application submission is critical. Saving a node
information is perhaps not critical. It would affect system correctness is critical write
operation errors are allowed to be ignored. We end up with YARN-4118 and other such potential
issues. Essentially we are turning a write-ahead log into something that optional. I dont
see how the system can make stable reliability guarantees by making the write-ahead log non-fatal.
On the other hand read errors or non-critical write errors should not block RM progress but
do need to be potentially retried. That also does not seem to be addressed in the patch.

> Followup fixes after YARN-2019 regarding RM behavior when state-store error occurs
> ----------------------------------------------------------------------------------
>                 Key: YARN-4087
>                 URL: https://issues.apache.org/jira/browse/YARN-4087
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jian He
>            Assignee: Jian He
>             Fix For: 2.7.2, 2.6.2
>         Attachments: YARN-4087-branch-2.6.patch, YARN-4087.1.patch, YARN-4087.2.patch,
YARN-4087.3.patch, YARN-4087.5.patch, YARN-4087.6.patch, YARN-4087.7.patch
> Several fixes:
> 1. Set YARN_FAIL_FAST to be false by default, since this makes more sense in production
> 2. If HA is enabled and if there's any state-store error, after the retry operation failed,
we always transition RM to standby state.  Otherwise, we may see two active RMs running. YARN-4107
is one example.

This message was sent by Atlassian JIRA

View raw message