hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Thu, 15 Mar 2012 22:14:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230629#comment-13230629

Suresh Srinivas commented on HDFS-3077:

bq. but like Einstein said, no simpler!
Its all relative :-)

BTW it would be good write design for this. That avoid lenghty comments and keeps the summary
of what is proposed in place, instead of scattering in multiple comments.

bq. This is mostly great – so long as you have an external fencing strategy which prevents
the old active from attempting to continue to write after the new active is trying to read.

External fencing is not needed, given active daemons having ability to fence.

bq. it gets the loggers to promise not to accept edits from the old active
The daemons can stop accepting writes when it realizes that active lock is no longer held
by the writer. Clearly an advantage of an active daemon compared to using passive storage.

bq. But, we still have one more problem: given some txid N, we might have multiple actives
that have tried to write the same transaction ID. Example scenario:
The case of writes making it though some daemons can also be solved. The writes that have
made through W daemons wins. The others are marked not in sync and need to sync up. Explanation
to follow.

The solution we are building is specific to namenode editlogs. There is only one active writer
(as Ivan brought up earlier). Here is the outline I am thinking of.

Lets start with steady state with K of N journal deamons. When a journal daemon fails, we
roll the edits. When a journal daemon joins, we roll the edits. New journal daemon could start
syncing other finalized edits, while keeping track of edits in progress. We also keep track
of the list of the active daemons in zookeeper. Rolling gives a logical point for newly joined
daemon to sync up (sort of like generation stamp). 
During failover, the new active, gets from the actively written journals, the point to which
it has to sync up to. It then rolls the edits also to that point. Rolling also gives you a
way to discard extra journal records that made it to < W daemons, during failover. When
there are overlapping records, say e1-105 and e100-200, you read 100-105 from the second editlog,
and discard it from the first editlog.

Again there are scenarios that are missing here. I plan to post more details in a design on
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message