Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Wed, 19 Sep 2012 08:45:11 +1100 (NCT)
From: "Todd Lipcon (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <1571212749.94679.1348004711977.JavaMail.jiratomcat@arcas>
In-Reply-To: <4335590.3899.1331577879305.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading
 and writing edit logs
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458208#comment-13458208 ] 

Todd Lipcon commented on HDFS-3077:
-----------------------------------

>> given we already have the journal daemons, it's trivial to generate unique increasing sequence IDs
> But may still be unnecessary. May be during the code review I might find indeed it is trivial.

This code has been committed on the branch for about 2 months, and the relevant patch was first on this JIRA on April 2nd. I think it's a bit late to consider this fundamental of a re-structure now.

bq. In this case, you have leader/active(to loosely to put it) elected at zk and then active has to establish epoch at znodes to become primary. Both of this needs to be complete before an active becomes functional. Given the "two things" that needs to happen, is a situation possible when one NN is active at zk while not the primary at the journal nodes and the other NN is not active at zk while is a primary at journal nodes

No, this is not possible, since NNs don't try to "re-acquire writer status" (i.e start a new epoch) once they've lost it. So, even if a node thinks it is active, if another node is _actually_ active, the first node will fail the next time it tries to write. This will cause it to abort, regardless of whether ZK has told it to be active or not.

Since I think it's clearer to explain with a couple examples:

Example 1: manual failover (simplest case, doesn't depend on ZK at all)

1. NN1 is active. NN2 is standby.
2. Admin issues a "failover" command, but for some reason the admin is partitioned from NN1. So, NN1 remains in Active mode, while NN2 also enters active mode.
3. NN2, upon entering active mode, starts a new epoch on the JournalNodes.
4. NN1, upon the next time it tries to perform a write, gets back an exception from a quorum of nodes that its epoch is too old. Since it could not logSync() and the shared edits dir is marked "required", it aborts.

Example 2: automatic failover with ZK and network partitions

1. NN1 is active. NN2 is standby.
2. NN1 becomes partitioned from ZooKeeper. Thus, it receives a ZooKeeper "Disconnected" event. Because "Disconnected" is not the same as "Expired", NN1 does not immediately transition to standby. Instead, it stays in its current state (active). Because it can still reach the JNs, it can continue writing.
3. NN2 is still connected to ZK, and thus sees that NN1's ephemeral node has disappeared (after the ZK session timeout elapses). It then transitions itself to active.
4. NN2, upon becoming active, starts a new epoch at the JournalNodes. As soon as this happens, NN1 may no longer write, and aborts.

Note that in both cases, even though NN1 can still reach a quorum of JNs, it doesn't try to start a new epoch after it has been fenced.

Does that address the concern?
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>
>         Attachments: hdfs-3077-partial.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on shared storage such as an NFS filer for the shared edit log. One alternative that has been proposed is to depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated edit log on commodity hardware. This JIRA is to implement another alternative, based on a quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only by HDFS's needs rather than more generic use cases. More details to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira