hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Wed, 05 Sep 2012 20:37:09 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449101#comment-13449101
] 

Todd Lipcon commented on HDFS-3077:
-----------------------------------

bq. Hey Todd, I have not looked at the work in this branch in a while. One thing I wanted
to ask you about is, why are we using journal daemons to decide on an epoch? Could zookeeper
be used for doing the same? What are the advantages of using journal daemons instead of zk?
Adding this information to the document might also be useful.

Certainly you could use ZK to generate an increasing sequence ID to decide on an epoch. But,
given we already have the journal daemons, it's trivial to generate unique increasing sequence
IDs without using an external dependency. The protocol is very simple:
- ask each of the JNs for their highest epoch seen
- set local epoch to one higher than the highest seen from any JN
- ask JNs to promise you that epoch

If you succeed on a quorum, then no one else can successfully achieve a quorum on the same
epoch number. If you don't succeed, that means you raced with some other writer. At that point
you could either retry or just fail.

There is a stress test that verifies that this protocol works correctly - please see TestEpochsAreUnique.

As for the advantages of not depending on ZooKeeper, my experience working with ZK in the
context of the HBase Master has convinced me that it's not a panacea for situations like this.
One of the biggest issues we've had in the HBase Master design is loss of synchronization
between what is the truth in ZooKeeper vs what the individual participants think is the truth.
ZooKeeper's consistency semantics are that different clients, when connected to different
nodes in the quorum, may be arbitrarily "behind" in their view of the data. This means that,
even if we update an epoch number in ZooKeeper, for example, one of the JNs may not receive
the update for some number of seconds, and can continue to accept writes from previous writers.
So, we still have to deal with fencing and all of these quorum protocols on our own, and I
don't think ZK provides much for us.

The other advantage of building this as a self-contained system is that it's easier for us
to test and debug. For example, the randomized test cases have been set up so that the entire
system runs single-threaded and, given a random seed, can reproduce a given set of dropped
messages. This would be very hard to implement on top of ZooKeeper where all of the messaging
is opaque to our purposes.

The third thing I'll mention is what I informally call the "two things" problem: when you
have some data in ZK, and some data on the JNs, it's possible that the two could get out of
sync. For example, if an administrator accidentally reformats ZooKeeper, our fencing guarantees
will become screwed up. So, we have to guard against this, add code to re-format ZK safely,
etc. Another example situation is to consider what happens when a NN is partitioned from the
majority of the ZooKeeper nodes but not partitioned from a majority of the JournalNodes. Should
it stop writing? If the other NN can reach a quorum of ZK but not a quorum of JN, should it
begin writing? Or should the whole system stop in its tracks? If the whole system stops, then
we have introduced an availability dependency on ZooKeeper such that no edits may be made
while ZK is down. This is worse off than we are today: we can continue operating while ZK
is down (though we can't process a new failover).

So, to summarize, while I think ZK can reduce complexity for a lot of applications, in this
case I prefer the control from "doing it ourselves". We already have to build all of the quorum
counting infrastructure, etc, and don't see what there is to gain from the extra dependency.
Hope all of the above makes sense!
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>
>         Attachments: hdfs-3077-partial.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message