Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Mon, 6 Feb 2012 21:06:59 +0000 (UTC)
From: "Todd Lipcon (Commented) (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: 
 <228613664.3647.1328562419463.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <951899290.32159.1326324110086.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HDFS-2782) HA: Support multiple shared edits
 dirs
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201583#comment-13201583 ] 

Todd Lipcon commented on HDFS-2782:
-----------------------------------

Been thinking about this a bit... I don't think it's so trivial as to just use our existing JournalSet implementation across multiple remote directories. The issue is the following case:
- HA cluster configured with two NNs: NN1 and NN2, and two shared storage directories (SD1 and SD2)
- NN1 is active and writing to SD1 and SD2 when a network issue occurs which partitions NN1 from SD2 and NN2 from SD1 (this isn't that unlikely, for example if NN1 and SD1 share a rack while NN2 and SD2 share a rack).
- If a failover occurs at this point, NN2 could take over without reading the most recent edits from SD1, resulting in a divergent namespace.

I have two possible solutions in mind:

*Solution 1: use a traditional quorum system*

When a NN writes, require that all writes must be synced to _W_ shared storage directories. When the SBN reads, require that it read edits from _R_ shared storage directories. The quorum requirement is that _R + W > N_ where _N_ is the total number of shared dirs.

In the error case described above, we could set W = 1 and R = 2. Thus the active NN could continue to operate even if one of the SDs is down -- but no failovers could occur during that window of time. Once the directory is restored, the system would be fully operational again and ready for failover.

The usual quorum requirement that W > N/2 would not be necessary here, since we already ensure a designed single writer by fencing operations.

*Solution 2: use an external source of record to agree on storage directory state.*

In this solution, an external system (likely zookeeper) is used to agree upon the state of the storage directories. ZK would contain a znode which lists the active shared directories. When the active NN writes, it writes to all of these active directories. If any of the writes fail, it must update the znode to mark the failed directory as out-of-date before acking the write.

When a directory is restored, it will be re-added to the znode listing active directories.

When the SBN processes a failover, it considers only those directories listed in the znode as active.


Any other solutions I'm not thinking of here?
                
> HA: Support multiple shared edits dirs
> --------------------------------------
>
>                 Key: HDFS-2782
>                 URL: https://issues.apache.org/jira/browse/HDFS-2782
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>
> Supporting multiple shared dirs will improve availability (eg see HDFS-2769). You may want to use multiple shared dirs on a single filer (eg for better fault isolation) or because you want to use multiple filers/mounts. Per HDFS-2752 (and HDFS-2735) we need to do things like use the JournalSet in EditLogTailer and add tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira