hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
Date Thu, 12 Sep 2013 18:08:56 GMT

     [ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated YARN-1185:
-----------------------------

    Summary: FileSystemRMStateStore can leave partial files that prevent subsequent recovery
 (was: FileSystemRMStateStore doesn't use temporary files when writing data)

bq. The RM will not start if there is anything wrong with the stored state. So it some write
is partial/empty is will not start.

The concern I have about that approach is it requires manual intervention from ops when there
is a problem, and the current scheme can lead to that situation occurring because the RM can
crash at arbitrary points.  I think the RM should try to prevent that situation from occurring
and/or have the ability to automatically recover from that situation if it does occur.  The
RM could skip the corrupted info and continue if the info is deemed not critical to the overall
recovery process.  Then we're only involving ops if the corruption is very serious.

{quote}
So we could do the following.
Storing app data may continue to be optimistic and since thats the main workload we continue
to do what we do today.
Storing global data (mainly the security stuff) can change to be more atomic.
{quote}

That sounds reasonable, especially if the RM is more robust during recovery.  I understand
it's a tradeoff between reliability and performance, especially with the RPC overhead when
talking to HDFS and the potentially high rate of state churn.

Thanks for the informative discussion, [~bikassaha]!  Updating the summary to better reflect
the problem and not a particular solution.
                
> FileSystemRMStateStore can leave partial files that prevent subsequent recovery
> -------------------------------------------------------------------------------
>
>                 Key: YARN-1185
>                 URL: https://issues.apache.org/jira/browse/YARN-1185
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>
> FileSystemRMStateStore writes directly to the destination file when storing state. However
if the RM were to crash in the middle of the write, the recovery method could encounter a
partially-written file and either outright crash during recovery or silently load incomplete
state.
> To avoid this, the data should be written to a temporary file and renamed to the destination
file afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message