hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Fri, 05 Oct 2012 19:44:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470593#comment-13470593

Todd Lipcon commented on HDFS-3077:

bq. You raised the objection that this breaks the Journal abstraction. Think of this as an
"info-field" of the special no-op transaction where the journal impl specific information
is stored; 

This would be problematic for several reasons:
1) "rollEdits" is not a JournalManager operation. The JournalManager treats edits as opaque
things written by the higher level FSEditLog code. Thus it cannot inject/modify the operations.
2) If the JournalManager is meant to modify the transaction content, this implies that two
different JournalManagers would produce different values for the same transaction. Thus, the
locally-stored edit log segment would differ in contents from a remotely stored edit log segment.
This makes me really nervous: we should see multiple copies of a log as identical replicas
of the same information, not adulterated with any storage-specific info.
3) In order to address the above issues, we'd have to add QJM-specific code into the NameNode,
and introduce the concept of epochs into the generic interfaces. This "bleed" of QJM concepts
into the main source code is something we are explicitly trying to avoid by introducing the
JournalManager API.

I am also thinking back to our discussion last summer during the HDFS-1073 work (particularly
HDFS-2018 and HDFS-1580), where you had argued that segments themselves should be considered
an implementation detail of the JournalManager. So, adding information which is required for
correctness into the START_LOG_SEGMENT written by the NameNode layer takes us farther away
from that goal instead of closer to it.

bq. Suresh and I have been looking at the design and compared it to Paxos and Zab in detail
and have concluded that the design is closer to ZAB than Paxos...

Sure, it's very close to ZAB as well, which I mentioned above in the discussion. I honestly
see ZAB and Paxos as basically the same thing -- ZAB (and QJM) use something very close to
Paxos when they switch epochs. The main difference between QJM and ZAB is that ZAB actually
maintains full histories at each of the nodes, because it needs to implement a state machine
(the database state). In contrast, QJM allows a journal node to get kicked out for one segment,
then join again in the next segment even if it's missing some txns in between. This is OK
because it is not trying to maintain state, just act as storage, and IMO it makes things simpler.
This difference is enough that I don't think we should explicitly say that this is an implementation
of ZAB.

To be perfectly frank, I'm not interested in changing the design substantially at this point
without a good reason. I've put several weeks into testing this design, and unless you can
find a counter-example or a bug, I am against changing it. If you want to do the work and
produce a patch which makes the code simpler, and it can pass 20,000 runs of the randomized
fault test, I'd be happy to review your patch. Or if you can point a flaw out in the current
design that's addressed by your proposed change, I'll do the work to address it. But as is,
I am confident that the design is correct and don't have more time to allocate to shifting
things around unless there's a bug or another real problem which would negatively affect its
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, hdfs-3077.txt,
hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf,
qjournal-design.pdf, qjournal-design.tex, qjournal-design.tex
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message