hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Kelly (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Wed, 14 Mar 2012 18:56:40 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229500#comment-13229500

Ivan Kelly commented on HDFS-3077:

 *  Re-uses existing Hadoop subsystems like IPC, security, and the file-based edit logging
code. This means that it will be easier to maintain for the Hadoop development community,
and easier to deploy for Hadoop operations.
 *  Doesn't introduce a new dependency on an external project. If there is a bug discovered
in this code, we can fix it with a new Hadoop release without having to wait on a new release
of ZooKeeper. Since ZK and HDFS may be managed by different ops teams, this also simplifies
These arguments seem very much to be a case of NIH.

 *  BookKeeper is a general system, whereas this is a specific system. Since BK tries to be
quite general, it has extra complexity that we don't need. For example, it handles the interleaving
of up to thousands of distinct edit logs into a single on-disk layout. These complexities
are useful for a general "write-ahead log as a service" project, but not for our use case
where even very large clusters have only a handful of distinct logs.
So the plan is to step around this complexity by implementing ZAB?

 *  BookKeeper's commit protocol waits for all replicas to commit. This means that, should
one of the bookies fail, one must wait for a rather lengthy timeout before continuing. Additionally,
the latency of a commit is the maximum of the latency of the bookies, meaning that it's much
less feasible to collocate bookies with other machines under load like DataNodes. A quorum
commit protocol instead has a latency equal to the median of its replicas' latencies, allowing
it to ride over transient slowness on the part of one of its replicas.
It would be actually very simple to change this within BookKeeper if needed. Instead of sending
to a quorum, you could send to the ensemble, wait for responses from quorum. None of the guarantees
of bookkeeper would be broken, though throughput would obviously drop. Currently, with BookKeeper,
we're able to get higher throughput than when using a filer or a local file[1].

Also, I don't think ZAB is the right tool for this in any case. You have a single writer,
which can therefore act as a sequencer on the entries. You just need to broadcast to an ensemble,
and wait for quorum responses, as I outlined above for BookKeeper.

[1] http://people.apache.org/~ivank/tpt_mar14.pdf
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message