hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1602) All failed RMStateStore operations should not be RMFatalEvents
Date Thu, 16 Jan 2014 17:58:19 GMT

    [ https://issues.apache.org/jira/browse/YARN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873670#comment-13873670
] 

Bikas Saha commented on YARN-1602:
----------------------------------

not all events are app related. some store secret key stores which cannot be ignored. 
what errors are we seeing in the store. if these are non-transient errors then the RM should
probably stop. if these are transient errors then I remember discussing with [~vinodkv] and
[~jianhe] about this offline. The summary is that the state store client (eg HDFS client)
should retry enough times to cover cases of transient errors in the store.
With HA states now, we should ideally not kill the RM but just transitionToStandby().

> All failed RMStateStore operations should not be RMFatalEvents
> --------------------------------------------------------------
>
>                 Key: YARN-1602
>                 URL: https://issues.apache.org/jira/browse/YARN-1602
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>
> Currently, if a state store operation fails, depending on the exception, either a RMFatalEvent.STATE_STORE_FENCED
or RMFatalEvent.STATE_STORE_OP_FAILED events are created. The latter results in the RM failing.
Instead, we should probably kill the application corresponding to the store operation. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message