hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3092) Enable journal protocol based editlog streaming for standby namenode
Date Thu, 12 Apr 2012 18:55:18 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13252733#comment-13252733

Bikas Saha commented on HDFS-3092:

Combination of this and HDFS-3077 sound very much like ZAB with one difference - in the use
of 2-phase commit for the write broadcast.

Lets say there is 1 active NN writing to a quorum set of journal daemons. This is the same
as ZAB. Active NN writes edits and ZAB leader writes new states.

ZAB uses 2 phase commits (without abort) for each write while our design is getting away without
it. I am wondering why we can get away with it.

My guess is that each follower in ZAB can also serve reads from clients. Hence, it cannot
serve an update until it is guaranteed that a quorum of followers has agreed on that update.
That is what 2 phase commit gives. 
In our case, the active NN is the only server for client reads. Hence, updates are not served
to clients until a quorum acks back.

However, the above would break for us if the standby NN is using any journal daemon to refresh
its state. Because, ideally, a journal node should not inform the standby about an update
until the it knows that the update has been accepted by a quorum of journal daemons. That
would require a 2 phase commit.
E.g. Standby NN3 reads the last edit written to JN1 by old active NN1, before NN1 realized
that it has lost quorum to NN2 (by failing to write to JN2 and JN3).

Perhaps we can get away with this by using some assumptions on timeouts, or by additional
constraints on the standby. Eg. that it only syncs with finalized edit segments.

If we say that the standby sync with only the finalized log segments in order to be safe from
the above, then IMO, the tailing of the edits by the standby should not be done by the standby
directly but via a journal daemon API for the standby. This JD API would ensure that only
valid edits are being sent to the standby (edits from finalized segments or edits known to
be safely committed to a quorum of journal daemons). This way the correctness of the journal
protocol would remain inside it. Instead of leaking it into the standby by having the standby
code remember rules for tailing edits.

> Enable journal protocol based editlog streaming for standby namenode
> --------------------------------------------------------------------
>                 Key: HDFS-3092
>                 URL: https://issues.apache.org/jira/browse/HDFS-3092
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ha, name-node
>    Affects Versions: 0.24.0, 0.23.3
>            Reporter: Suresh Srinivas
>            Assignee: Suresh Srinivas
>         Attachments: MultipleSharedJournals.pdf, MultipleSharedJournals.pdf, MultipleSharedJournals.pdf
> Currently standby namenode relies on reading shared editlogs to stay current with the
active namenode, for namespace changes. BackupNode used streaming edits from active namenode
for doing the same. This jira is to explore using journal protocol based editlog streams for
the standby namenode. A daemon in standby will get the editlogs from the active and write
it to local edits. To begin with, the existing standby mechanism of reading from a file, will
continue to be used, instead of from shared edits, from the local edits.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message