hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-3955) QJM: Make acceptRecovery() atomic
Date Wed, 19 Sep 2012 02:49:07 GMT
Todd Lipcon created HDFS-3955:

             Summary: QJM: Make acceptRecovery() atomic
                 Key: HDFS-3955
                 URL: https://issues.apache.org/jira/browse/HDFS-3955
             Project: Hadoop HDFS
          Issue Type: Sub-task
          Components: ha
    Affects Versions: QuorumJournalManager (HDFS-3077)
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon

Per one of the TODOs in Journal.java, there is currently a lack of atomicity in the {{acceptRecovery()}}
code path. In particular, we have the following actions executed non-atomically:
- Download a new edits_inprogress_N from some other node
- Persist the paxos recovery file to disk.

If the JN crashes between these two steps, then we may be left in the state whereby the edits_inprogress
file has different data than the Paxos data left over on the disk from a previous recovery
attempt. This causes the next {{prepareRecovery()}} to fail with an AssertionError.

I discovered this by randomly injecting a fault between the two steps, and then running the
randomized fault test on a cluster. This resulted in some AssertionErrors in the test logs.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message