hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3845) Fixes for edge cases in QJM recovery protocol
Date Thu, 23 Aug 2012 02:24:42 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440006#comment-13440006

Todd Lipcon commented on HDFS-3845:

The edge cases discovered center around the treatment of a particularly nasty failure case
that was partially described in the design doc, where a minority of nodes receive the first
transaction of a new segment, and then some very specific faults occur during a subsequent
recovery of the same segment. One example, which is covered by the new targeted test cases:

*Initial writer*
- Writing to 3 JNs: JN0, JN1, JN2: 
- A log segment with txnid 1 through 100 succeeds.
- The first transaction in the next segment only goes to JN0 before the writer crashes (eg
it is partitioned)
*Recovery by another writer*
- The new NN starts recovery and talks to all three. Thus, it sees that the newest log segment
which needs recovery is 101. 
- It sends the prepareRecovery(101) call, and decides that the recovery length for 101 is
only the 1 transaction.
- It sends acceptRecovery(101-101) to only JN0, before crashing

This yields the following state:
- JN0: 1-100 finalized, 101_inprogress, accepted recovery: 101-101
- JN1: 1-100 finalized, 101_inprogress.empty
- JN2: 1-100 finalized, 101_inprogress.empty
 (the .empty files got moved aside during recovery)

If a new writer now comes along and writes to only JN1 and JN2, and crashes during that segment,
we are left with the following:

- JN0: 1-100 finalized, 101_inprogress, accepted recovery: 101-101
- JN1: 1-100 finalized, 101_inprogress with txns 101-150
- JN2: 1-100 finalized, 101_inprogress with txns 101-150

A recovery at this point, with the old protocol, would incorrectly truncate the log to transaction
ID 101.

The solution is to modify the protocol by cribbing an idea from ZAB. In addition to having
the "accepted epoch" below which no RPCs will be responded to, we also have a "writer epoch",
which is set only when a new writer has successfully performed recovery and starts to write
to a node. In the case above, the "writer epoch" gets incremented by the new writer writing
transactions 101 through 150, so that the augmented state looks like:

- JN0: 1-100 finalized, 101_inprogress, accepted recovery: 101-101, lastWriterEpoch=1
- JN1: 1-100 finalized, 101_inprogress with txns 101-150, lastWriterEpoch=3
- JN2: 1-100 finalized, 101_inprogress with txns 101-150, lastWriterEpoch=3

The recovery protocol then uses the fact that JN1 and JN2 have a higher "writer epoch" to
distinguish JN0's replica as stale, and perform the correct recovery.

This idea was briefly mentioned in the design doc, but rejected because I thought we could
get away with a shortcut hack. But as described above, the hack doesn't work when you have
this particular sequence of faults. So, fixing it in a more principled way makes sense at
this point.

The other improvement in the recovery protocol is that, previously, the JournalNode response
to the prepareRecovery RPC without correct information as to the in-progress/finalized state
of the segment. This was because it used the proposed recovery SegmentInfo, instead of the
state on disk, when there was a previously accepted proposal. The new protocol changes this
behavior to now assert that, if there was a previous accepted recovery, the state on disk
matches (in length) what was accepted, and then to send that back to the client. This allows
the client to distinguish a recovery that was already finalized from one that did not yet
reach the finalization step, allowing for better sanity checks. With the writerEpoch change
above, I don't this change was necessary for correctness, but it is an extra layer of assurance
that we don't accidentally change the length of a segment if any replica has been finalized.

With these improvements in place, I've been able to run several machine-weeks of the randomized
fault tests without discovering any more issues in the protocol (I hooked it up to a Streaming
job to run in parallel on a 100-node test cluster).
> Fixes for edge cases in QJM recovery protocol
> ---------------------------------------------
>                 Key: HDFS-3845
>                 URL: https://issues.apache.org/jira/browse/HDFS-3845
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
> After running the randomized tests from HDFS-3800 for several machine-weeks, we identified
a few edge cases in the recovery protocol that weren't properly designed for. This JIRA is
to modify the protocol slightly to address these edge cases.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message