hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Tue, 03 Apr 2012 20:24:22 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245691#comment-13245691

Todd Lipcon commented on HDFS-3077:

bq. Terminology - JournalDaemon or JournalNode. I prefer JournalDaemon because my plan was
to run them in the same process space as the namenode. A JournalDeamon could also be stand-alone

I prefer JournalNode because every other daemon we have is a *Node. If you're running it inside
another process, I think we would just call it a "JournalService" -- or an "embedded JournalNode".
I think of a daemon as a standalone process.

bq. I like the idea of quorum writes and maintaining the queue. 3092 design currently uses
timeout to declare a JD slow and fail it. We were planning to punting on it until we had first

OK. This part I have done in the patch attached here and works pretty well, so far. If you
want, I'm happy to separate out the quorum completion code to commit it ASAP so we can share
code here.

bq. newEpoch() is called fence() in HDFS-3092. My preference is to use the name fence(). I
was using version # which is called epoch. I think the name epoch sounds better. The key difference
is that version # is generated from znode in HDFS-3092.

As I had commented earlier on this ticket, I originally was planning to do something similar
to you, bootstrapping off of ZK to generate epoch numbers. But then, when I got into coding,
I realized that this algorithm is actually not so hard to implement, and adding a dependency
on ZK actually adds to the combinatorics of things to think about. I think the "standalone"
nature of the approach outweighs what benefit we might get by reusing ZK.

bq. So two namenodes cannot use the same epoch number. I think there is a bug with the approach
you have described, stemming from the fact that two namenodes can use the same epoch and step
3 in 2.4 can be completed independent of quorum. This is shown in Hari's example.

How can step 3 in section 2.4 be completed independent of quorum? Step 4 indicates that it
requires a quorum of nodes to respond successfully to the {{newEpoch}} message. Here's an

Initial state:

1. Two NNs (NN1 and NN2) enter step 1 concurrently. They both receive responses indicating
{{lastPromisedEpoch==1}} from all of the JNs.
2. They both propose {{newEpoch(2)}}. The behavior of the JN ensures that it will only respond
success to either NN1 or NN2, but not both (since it will fail if the proposedEpoch <=
So, either NN1 or NN2 gets success from a majority. The other node will only get success from
a minority, and thus will abort.

Note that with message losses or failures, it's possible for _neither_ of the nodes to get
a quorum in the case of a race. That's OK, since we expect that an external leader election
framework will eventually assist such that only one NN is trying to become active, and then
that NN will win.

Note that the epoch algorithm is cribbed from ZAB, see page 7 of Yahoo tech report YL-2010-0007.
The mapping from ZAB terminology is:
||ZAB term||QJournal term||
|CEPOCH(e)|Response to getLastPromisedEpoch()|
|ACK-E(...)|success response to newEpoch()|

> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3077-partial.txt, qjournal-design.pdf
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message