Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 02AA19970 for ; Mon, 6 Feb 2012 21:07:22 +0000 (UTC) Received: (qmail 97562 invoked by uid 500); 6 Feb 2012 21:07:21 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 97150 invoked by uid 500); 6 Feb 2012 21:07:20 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 97142 invoked by uid 99); 6 Feb 2012 21:07:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 21:07:20 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 21:07:19 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 70C71CC92B for ; Mon, 6 Feb 2012 21:06:59 +0000 (UTC) Date: Mon, 6 Feb 2012 21:06:59 +0000 (UTC) From: "Todd Lipcon (Commented) (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <228613664.3647.1328562419463.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <951899290.32159.1326324110086.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HDFS-2782) HA: Support multiple shared edits dirs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201583#comment-13201583 ] Todd Lipcon commented on HDFS-2782: ----------------------------------- Been thinking about this a bit... I don't think it's so trivial as to just use our existing JournalSet implementation across multiple remote directories. The issue is the following case: - HA cluster configured with two NNs: NN1 and NN2, and two shared storage directories (SD1 and SD2) - NN1 is active and writing to SD1 and SD2 when a network issue occurs which partitions NN1 from SD2 and NN2 from SD1 (this isn't that unlikely, for example if NN1 and SD1 share a rack while NN2 and SD2 share a rack). - If a failover occurs at this point, NN2 could take over without reading the most recent edits from SD1, resulting in a divergent namespace. I have two possible solutions in mind: *Solution 1: use a traditional quorum system* When a NN writes, require that all writes must be synced to _W_ shared storage directories. When the SBN reads, require that it read edits from _R_ shared storage directories. The quorum requirement is that _R + W > N_ where _N_ is the total number of shared dirs. In the error case described above, we could set W = 1 and R = 2. Thus the active NN could continue to operate even if one of the SDs is down -- but no failovers could occur during that window of time. Once the directory is restored, the system would be fully operational again and ready for failover. The usual quorum requirement that W > N/2 would not be necessary here, since we already ensure a designed single writer by fencing operations. *Solution 2: use an external source of record to agree on storage directory state.* In this solution, an external system (likely zookeeper) is used to agree upon the state of the storage directories. ZK would contain a znode which lists the active shared directories. When the active NN writes, it writes to all of these active directories. If any of the writes fail, it must update the znode to mark the failed directory as out-of-date before acking the write. When a directory is restored, it will be re-added to the znode listing active directories. When the SBN processes a failover, it considers only those directories listed in the znode as active. Any other solutions I'm not thinking of here? > HA: Support multiple shared edits dirs > -------------------------------------- > > Key: HDFS-2782 > URL: https://issues.apache.org/jira/browse/HDFS-2782 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha > Affects Versions: HA branch (HDFS-1623) > Reporter: Eli Collins > Assignee: Eli Collins > > Supporting multiple shared dirs will improve availability (eg see HDFS-2769). You may want to use multiple shared dirs on a single filer (eg for better fault isolation) or because you want to use multiple filers/mounts. Per HDFS-2752 (and HDFS-2735) we need to do things like use the JournalSet in EditLogTailer and add tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira