hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Tue, 02 Oct 2012 02:33:08 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467420#comment-13467420

Todd Lipcon commented on HDFS-3077:

bq. What I dislike is that in example in 2.10.6, recovery completed sucessfully, the new writer
starts writing normally and then fails - you delete a segment that was successfully finalized
at quorum JNs - this is the only example of successful transaction being deleted. In this
situation won't the normal processing select 151 to be the common transaction and hence result
into segment edits_151_151? Or is it that you can't distinguish between this case and the
case where the quorum failed?

I think maybe the description in 2.10.6 is unclear/confusing. It isn't deleting any segment
which has been finalized on a quorum of JNs. Here's the full sequence of events:

1. NN1 is writing, and successfully calls sendEdits(150) to write txid 150. All nodes ACK.
2. NN1 sends finalizeSegment(1-150). All nodes ACK.
3. NN1 sends startLogSegment(151). All nodes ACK and create files called edits_inprogress_151,
which at this point are actually empty.
4. NN1 sends sendEdits(151-153) with the first batch of edits for the new log file. It only
reaches JN1 since NN1 crashes. Therefore transactions 151-153 were not "committed", and may
either be recovered or not recovered.
5. NN2 starts recovery. JN2 and JN3 at this point have the empty log file edits_inprogress_151.
Because it's empty, they delete it. This is the same behavior that a NameNode takes at startup
today in branch-2 -- if there is an entirely empty edit log file, it is removed at startup.
6. NN2 for some reason does not talk to JN1 here (most likely because JN1 was located on the
same node as NN1 which just crashed). So, it sees that there were no valid transactions after
txid 150, and does not need to perform any recovery.
7. NN2 starts writing again at txid=151. It's able to successfully write because it can speak
to JN2 and JN3 still. Imagine that it writes just one txn (151) and then crashes.
8. At this point, we are in the state referred to by the second table in Section 2.10.6: all
three nodes have edits_inprogress_151, and JN1 has more transactions than JN2 and JN3. Yet
we should use JN2 or JN3 as the recovery source, since it is from a newer writer (ie the _committed_
txns are on those nodes, not on JN1).

Is this explanation a little clearer? If so I will amend the doc. If I'm misunderstanding
your question, can you point out the situation where you think it might lose a committed transaction?
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, hdfs-3077.txt,
hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf,
qjournal-design.tex, qjournal-design.tex
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message