Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5A0BB10D23 for ; Sun, 6 Oct 2013 19:54:46 +0000 (UTC) Received: (qmail 33764 invoked by uid 500); 6 Oct 2013 19:54:43 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 33704 invoked by uid 500); 6 Oct 2013 19:54:43 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 33696 invoked by uid 99); 6 Oct 2013 19:54:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Oct 2013 19:54:42 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mbhandarkar@gopivotal.com designates 209.85.217.169 as permitted sender) Received: from [209.85.217.169] (HELO mail-lb0-f169.google.com) (209.85.217.169) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Oct 2013 19:54:36 +0000 Received: by mail-lb0-f169.google.com with SMTP id z5so4999312lbh.0 for ; Sun, 06 Oct 2013 12:54:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:references:in-reply-to:mime-version :thread-index:date:message-id:subject:to:content-type; bh=XIkH1XycE+wXYrHOMpxfhtnJtR75Pect/Qs5YfYKXGk=; b=jgP6H3bjeEwDAJa7t4WiBkbuYcbgbFX0ntObU882YzRH/z8ptOKuaIGDg6lWeJZXlA +2RX9r0BVOs4FRhZHxbH5YI65guBN/CfSUfVxSDU5/zfhjxcZFO5yDEQ9BV0uIzQoehB NtgGLesPhraTWxpzgcfFAxUVNVpoH0x8UmjWas80NajnsEzoiXloFuCRorLL0pKHbnzT C+3FZ2nPy3jj7EY5XdP8a5ugaOdxjv15wymO6bS3bvtP9vHIsnPxbvmHFwvMQMsvj6Ap sGXQsrXJwj/V7E/8OBtI6VVsd/euUuWiP9IeQXTgGO2wgX4uyMiaa+AvvOkyxPmApqtP g4Jg== X-Gm-Message-State: ALoCoQnAROYnfwSfUqCF1hPfYelFRNnBxKx8/BqeN4q2Tp5dhbbdAk4J1WTkkjk0Tps0zOz8xZpS X-Received: by 10.112.9.195 with SMTP id c3mr2467181lbb.33.1381089255903; Sun, 06 Oct 2013 12:54:15 -0700 (PDT) From: Milind Bhandarkar References: <1dedf7785089657ac51c768e10ecec70@mail.gmail.com> <51E595B7-3038-4DD6-BD78-50C7AC0DFA29@hortonworks.com> In-Reply-To: <51E595B7-3038-4DD6-BD78-50C7AC0DFA29@hortonworks.com> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQHmyXeeG3xQst7L0T++fIG/Y8uLOgJnVYDVAnl0F5MBx57iRJmCxJoQ Date: Sun, 6 Oct 2013 12:54:15 -0700 Message-ID: <0fe6d32baede483316aebf3bfb0529c4@mail.gmail.com> Subject: RE: [Proposal] Pluggable Namespace To: hdfs-dev@hadoop.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org Vinod, Block Pool management separation makes this effort easier. Even with that separation, the namespace implementation is still embedded within the namenode, federated or otherwise. This effort is much less ambitious. All it attempts to do is allow different namespace implementation. - milind -----Original Message----- From: Vinod Kumar Vavilapalli [mailto:vinodkv@hortonworks.com] Sent: Sunday, October 06, 2013 12:21 PM To: hdfs-dev@hadoop.apache.org Subject: Re: [Proposal] Pluggable Namespace In order to make federation happen, the block pool management was already separated. Isn't that the same as this effortt? Thanks, +Vinod On Oct 6, 2013, at 9:35 AM, Milind Bhandarkar wrote: > Federation is orthogonal with Pluggable Namespaces. That is, one can > use Federation if needed, even while a distributed K-V store is used > on the backend. > > Limitations of Federated namenode for scaling namespace are > well-documented in several places, including the Giraffa presentation. > > HBase is only one of the several namespace implementations possible. > Thus, if HBase-based namespace implementation does not fit your > performance needs, you have a choice of using something else. > > - milind > > -----Original Message----- > From: Azuryy Yu [mailto:azuryyyu@gmail.com] > Sent: Saturday, October 05, 2013 6:41 PM > To: hdfs-dev@hadoop.apache.org > Subject: Re: [Proposal] Pluggable Namespace > > Hi Milind, > > HDFS federation can solve the NN bottle neck and memory limit problem. > > AbstractNameSystem design sounds good. but distributed meta storage > using HBase should bring performance degration. > On Oct 4, 2013 3:18 AM, "Milind Bhandarkar" > > wrote: > >> Hi All, >> >> Exec Summary: For the last couple of months, we, at Pivotal, along >> with a couple of folks in the community have been working on making >> Namespace implementation in the namenode pluggable. We have >> demonstrated that it can be done without major surgery on the >> namenode, and does not have noticeable performance impact. We would >> like to contribute it back to Apache if there is sufficient interest. >> Please let us know if you are interested, and we will create a Jira >> and > update the patch for in-progress work. >> >> >> Rationale: >> >> In a Hadoop cluster, Namenode roughly has following main > responsibilities. >> . Catering to RPC calls from clients. >> . Managing the HDFS namespace tree. >> . Managing block report, heartbeat and other communication from data > nodes. >> >> For Hadoop clusters having large number of files and large number of >> nodes, name node gets bottlenecked. Mainly for two reasons . All the >> information is kept in name node's main memory. >> . Namenode has to cater to all the request from clients / data nodes. >> . And also perform some operations for backup and check pointing node. >> >> A possible solution is to add more main memory but there are certain >> issues with this approach . Namnenode being Java application, garbage >> collection cycles execute periodically to reclaim unreferenced heap >> space. When the heap space grows very large, despite of GC policy >> chosen, application stalls during the GC activity. This creates a >> bunch of issues since DNs and clients may perceive this stall as NN >> crash. >> . There will always be a practical limit on how much physical memory >> a single machine can accommodate. >> >> Proposed Solution: >> >> Out of the three responsibilities listed above, we can refactor >> namespace management from the namenode codebase in such a way that >> there is provision to implement and plug other name systems other >> than existing in-process memory-based name system. Particularly a >> name system backed by a distributed key-value store will >> significantly reduce namenode memory requirement.To achieve this, a >> new generic interface will be introduced [Let's call it >> AbstractNameSystem] which defines set of operations using which we >> perform the namespace management. Namenode code that used to >> manipulate some java objects maintained in namenode's heap will now operate on this interface. >> There will be provision for others to extend this interface and plug > their own NameSystem implementation. >> >> To get started, we have implemented the same memory-based namespace >> implementation in a remote process, outside of the namenode JVM. In >> addition, work is undergoing to implement the namesystem using HBase. >> >> Details of Changes: >> >> Created new class called AbstractNamesystem, existing FSNamesystem is >> a subclass of this class. Some code from FSNamesystem has been moved >> to its parent. Created a Factory class to create object of NS >> management class.Factory refers to newly added config properties to >> support pluggable name space management class. Added unit tests for >> Factory. Replaced constructors with factory calls, this is because >> the namesystem instances should now be created based on configuration. >> Added new config properties to support pluggable name space >> management class. This property will decide which Namesystem class >> will be instantiated by the factory. This change is also reflected in >> some DFS related webapps [JSP files] where namesystem instance is >> used to obtain > DFS health and other stats. >> >> These changes aim to make the namesystem pluggable without changing >> high level interfaces, this is particularly tricky since memory-based >> name system functionality is currently baked into these interfaces, >> and ultimate goal is to make the high level interface free from >> memory-based name system. >> >> Consideration for Upgrade and Rollback: >> >> Current memory based implementation already has code to read from and >> write to fsimage , we will have to make them publicly accessible >> which will enable us to upgrade an existing cluster from FSNamespace >> to newly added name system in future version. >> >> a. Upgrades: By making use of existing Loader class for reading >> fsimage we can write some code load this image into the future name >> system implementation. >> >> b. Rollback: Are even simpler, we can preserve the old fsimage and >> start the cluster with that image by configuring the cluster to use >> current file system based name system. >> >> Future work >> >> Current HDFS design is such that FSNameSystem is baked into even high >> level interfaces, this is a major hurdle in cleanly implementing >> pluggable name systems. We aim to propose a change in such interfaces >> into which FSNameSystem is tightly coupled. >> >> - Milind >> >> >> --- >> Milind Bhandarkar >> Chief Scientist >> Pivotal >> -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.