hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Tue, 18 Sep 2012 21:45:11 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458208#comment-13458208
] 

Todd Lipcon commented on HDFS-3077:
-----------------------------------

>> given we already have the journal daemons, it's trivial to generate unique increasing
sequence IDs
> But may still be unnecessary. May be during the code review I might find indeed it is
trivial.

This code has been committed on the branch for about 2 months, and the relevant patch was
first on this JIRA on April 2nd. I think it's a bit late to consider this fundamental of a
re-structure now.

bq. In this case, you have leader/active(to loosely to put it) elected at zk and then active
has to establish epoch at znodes to become primary. Both of this needs to be complete before
an active becomes functional. Given the "two things" that needs to happen, is a situation
possible when one NN is active at zk while not the primary at the journal nodes and the other
NN is not active at zk while is a primary at journal nodes

No, this is not possible, since NNs don't try to "re-acquire writer status" (i.e start a new
epoch) once they've lost it. So, even if a node thinks it is active, if another node is _actually_
active, the first node will fail the next time it tries to write. This will cause it to abort,
regardless of whether ZK has told it to be active or not.

Since I think it's clearer to explain with a couple examples:

Example 1: manual failover (simplest case, doesn't depend on ZK at all)

1. NN1 is active. NN2 is standby.
2. Admin issues a "failover" command, but for some reason the admin is partitioned from NN1.
So, NN1 remains in Active mode, while NN2 also enters active mode.
3. NN2, upon entering active mode, starts a new epoch on the JournalNodes.
4. NN1, upon the next time it tries to perform a write, gets back an exception from a quorum
of nodes that its epoch is too old. Since it could not logSync() and the shared edits dir
is marked "required", it aborts.

Example 2: automatic failover with ZK and network partitions

1. NN1 is active. NN2 is standby.
2. NN1 becomes partitioned from ZooKeeper. Thus, it receives a ZooKeeper "Disconnected" event.
Because "Disconnected" is not the same as "Expired", NN1 does not immediately transition to
standby. Instead, it stays in its current state (active). Because it can still reach the JNs,
it can continue writing.
3. NN2 is still connected to ZK, and thus sees that NN1's ephemeral node has disappeared (after
the ZK session timeout elapses). It then transitions itself to active.
4. NN2, upon becoming active, starts a new epoch at the JournalNodes. As soon as this happens,
NN1 may no longer write, and aborts.

Note that in both cases, even though NN1 can still reach a quorum of JNs, it doesn't try to
start a new epoch after it has been fenced.

Does that address the concern?
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>
>         Attachments: hdfs-3077-partial.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message