hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HDFS-3955) QJM: Make acceptRecovery() atomic
Date Wed, 19 Sep 2012 18:58:09 GMT

     [ https://issues.apache.org/jira/browse/HDFS-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Todd Lipcon resolved HDFS-3955.

       Resolution: Fixed
    Fix Version/s: QuorumJournalManager (HDFS-3077)
     Hadoop Flags: Reviewed
> QJM: Make acceptRecovery() atomic
> ---------------------------------
>                 Key: HDFS-3955
>                 URL: https://issues.apache.org/jira/browse/HDFS-3955
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>         Attachments: hdfs-3955.txt
> Per one of the TODOs in Journal.java, there is currently a lack of atomicity in the {{acceptRecovery()}}
code path. In particular, we have the following actions executed non-atomically:
> - Download a new edits_inprogress_N from some other node
> - Persist the paxos recovery file to disk.
> If the JN crashes between these two steps, then we may be left in the state whereby the
edits_inprogress file has different data than the Paxos data left over on the disk from a
previous recovery attempt. This causes the next {{prepareRecovery()}} to fail with an AssertionError.
> I discovered this by randomly injecting a fault between the two steps, and then running
the randomized fault test on a cluster. This resulted in some AssertionErrors in the test

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message