hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Wed, 03 Oct 2012 20:46:08 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468816#comment-13468816

Todd Lipcon commented on HDFS-3077:

bq. You list 4 steps and have comments that steps 2 and 4 correspond with the the two phases
of paxos. However you add step 3 in the middle which is not part of paxos

Not sure I follow. Here's the mapping, using "Paxos Made Simple" as a reference (http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf)

Step 2's request (PrepareRecovery) corresponds to Prepare (Phase 1a)
Step 2's response corresponds to "Promise" (Phase 1b)
Step 3's request (AcceptRecovery) corresponds to "Accept!" (Phase 2a)
Step 3's response corresponds to "Accepted" (Phase 2b)

After this point, Paxos itself is actually complete - the final consensus value is learned
once a quorum of nodes have completed Phase 2 for a given proposal.

Step 4's request (Finalize) corresponds to "Learned" or "Commit" which are generally sent
to "Learners" after the consensus phase completes. This is described in Section 2.3 of the
above-referenced paper:
More generally, the acceptors could respond with their acceptances to
some set of distinguished learners, each of which can then inform all the
learners when a value has been chosen.
In our case, the client doubles as the "distinguished learner", and the JNs all double as
the other Learners.

bq. Another concern is that this design has more steps than paxos which is generally considered
complicated to get right

Per above, I don't think there are any "extra" steps. It is a fairly faithful implementation
of Paxos except that we use a side-channel (HTTP from node-to-node) instead of the main message
passing channel to actually communicate the value to be decided.

I am also confident that the implementation is correct, after many CPU-years of randomized
fault testing. If you can find any counter-example I would be really glad to hear about it,
and will immediately drop everything to investigate.

bq. > This is the same behavior that a NameNode takes at startup today in branch-2 –
if there is an entirely empty edit log file, it is removed at startup.
bq. Why did we do this in normal local disk journal? Doesn't the no-op transaction handle
the "empty" case. We had added the non-op transaction to deal with repeated restarts and also
repeated rolls.

The no-op transaction isn't added atomically when the file is created. The file is created
empty, then the header is appended, then the noop (START_LOG_SEGMENT) txn is appended. The
NN could crash between any of these steps. If we open a log and find that it has no transactions
(ie not even the "noop"), then we know we crashed right after opening the log but before writing
anything (maybe not even the header). So, we can safely remove it and pretend the log was
never started.

> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, hdfs-3077.txt,
hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf,
qjournal-design.tex, qjournal-design.tex
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message