hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1602) All failed RMStateStore operations should not be RMFatalEvents
Date Mon, 20 Jan 2014 20:19:24 GMT

    [ https://issues.apache.org/jira/browse/YARN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876826#comment-13876826

Karthik Kambatla commented on YARN-1602:

The above error happens only after running a number of Oozie jobs on the RM for a while -
so, I don't think it is due to bad configuration. So, transitioning both RMs to Standby, would
only result in alternating between the two RMs becoming the Active until the application gets
killed because of exceeding the max-attempts. The only downside I see is the other applications
might also be killed in the process.

bq. The RMs will stop touching the store and the admin can fix it.
The admin might be able to fix it by explicitly deleting some znodes from the store, but that
would require understanding the store layout. 

Let me investigate more and see what the underlying cause for this issue is. May be, that
would simplify what we should do in such cases.

> All failed RMStateStore operations should not be RMFatalEvents
> --------------------------------------------------------------
>                 Key: YARN-1602
>                 URL: https://issues.apache.org/jira/browse/YARN-1602
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
> Currently, if a state store operation fails, depending on the exception, either a RMFatalEvent.STATE_STORE_FENCED
or RMFatalEvent.STATE_STORE_OP_FAILED events are created. The latter results in the RM failing.
Instead, we should probably kill the application corresponding to the store operation. 

This message was sent by Atlassian JIRA

View raw message