Return-Path: Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: (qmail 62241 invoked from network); 1 Mar 2010 18:45:16 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Mar 2010 18:45:16 -0000 Received: (qmail 29151 invoked by uid 500); 1 Mar 2010 18:45:13 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 29107 invoked by uid 500); 1 Mar 2010 18:45:13 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 29098 invoked by uid 99); 1 Mar 2010 18:45:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Mar 2010 18:45:13 +0000 X-ASF-Spam-Status: No, hits=-3.6 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS,SUBJECT_FUZZY_TION X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of awittenauer@linkedin.com designates 69.28.149.24 as permitted sender) Received: from [69.28.149.24] (HELO esv4-mav02.corp.linkedin.com) (69.28.149.24) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Mar 2010 18:45:03 +0000 DomainKey-Signature: s=prod; d=linkedin.com; c=nofws; q=dns; h=X-IronPort-AV:Received:User-Agent:Date:Subject:From:To: Message-ID:Thread-Topic:Thread-Index:In-Reply-To: Mime-version:Content-type:Content-transfer-encoding; b=Wn05vWFUZ+xh/teeRJC+u8/yVBt2dYMA1mwjVtkwayytQjt0x7muhf/r bvnqDhWUJwUf/MXzMsEuQYCGnQd1Vb18dXcDUv8Gce9buDR+iW0CTib7I MBiA87h7ynrA/EV; DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=linkedin.com; i=awittenauer@linkedin.com; q=dns/txt; s=proddkim; t=1267469103; x=1299005103; h=from:sender:reply-to:subject:date:message-id:to:cc: mime-version:content-transfer-encoding:content-id: content-description:resent-date:resent-from:resent-sender: resent-to:resent-cc:resent-message-id:in-reply-to: references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:list-owner:list-archive; z=From:=20Allen=20Wittenauer=20 |Subject:=20Re:=20Namespace=20partitioning=20using=20Loca lity=20Sensitive=20Hashing|Date:=20Mon,=2001=20Mar=202010 =2010:44:42=20-0800|Message-ID:=20|To:=20 |Mime-version:=201.0|Content-transfer-encoding:=207bit |In-Reply-To:=20<68432d881003010848m754a03c8vd279e7c9a908 90c5@mail.gmail.com>; bh=ke1F0o3bggzgWu6A+ikwTmtzMJCcsV7VLuZfh89Ft58=; b=kS5rgv8pT61qvCxkq3MQMmr1QBGZEi+IaRis/XpgX+4sjIGhc+WRfq++ 3bvu/zHx62sAz5oZ8HcEgiewJhKxOVC0Nzbzqen2NlLGsnwjKIU/sttnc jK392PT5ckW6TYp; X-IronPort-AV: E=Sophos;i="4.49,561,1262592000"; d="scan'208";a="11466037" Received: from 172.16.19.141 ([172.16.19.141]) by CORP-MAIL.linkedin.biz ([172.18.46.135]) via Exchange Front-End Server mail-access.linkedin.biz ([172.18.46.133]) with Microsoft Exchange Server HTTP-DAV ; Mon, 1 Mar 2010 18:44:42 +0000 User-Agent: Microsoft-Entourage/12.10.0.080409 Date: Mon, 01 Mar 2010 10:44:42 -0800 Subject: Re: Namespace partitioning using Locality Sensitive Hashing From: Allen Wittenauer To: Message-ID: Thread-Topic: Namespace partitioning using Locality Sensitive Hashing Thread-Index: Acq5b0E1SPeKEAgJHk6YPYJb/vBEgA== In-Reply-To: <68432d881003010848m754a03c8vd279e7c9a90890c5@mail.gmail.com> Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 3/1/10 8:48 AM, "Ketan Dixit" wrote: > How LSH is better than normal hashing? Because still, a client or a fixed > namenode has to take decision of which namenode to contact in whatever > hashing ? It looks to me that requests to files under same subtree are > directed to the same namenode then the performance will be faster as the > requests to the same namenode are clustered around the a part of namespace > subtree > (For example a part of on which client is doing some operation.) Is this > assumption correct? Can I have more insight in this regard. IIRC, the thought process was this was a scalability feature, not being done for performance. There is a general reluctance by the HDFS dev's to only store hot file metadata structures in memory. So in order to prevent the JVM's heap size from spiral out of control, using separate name spaces allows you to divide and conquer. With symlinks , this feature is essentially a solved problem. The 'who is the decision maker' issue is now the client's to resolve. As an added bonus, because it is URI based, the client may get pushed off to a completely different service. [This is definitely a feature--just think, you can store really hot files on the local file system, completely bypassing the overhead that HDFS incurs.]