Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of awittenauer@linkedin.com
 designates 69.28.149.24 as permitted sender)
DomainKey-Signature: s=prod; d=linkedin.com; c=nofws; q=dns;
  h=X-IronPort-AV:Received:User-Agent:Date:Subject:From:To:
   Message-ID:Thread-Topic:Thread-Index:In-Reply-To:
   Mime-version:Content-type:Content-transfer-encoding;
  b=Wn05vWFUZ+xh/teeRJC+u8/yVBt2dYMA1mwjVtkwayytQjt0x7muhf/r
   bvnqDhWUJwUf/MXzMsEuQYCGnQd1Vb18dXcDUv8Gce9buDR+iW0CTib7I
   MBiA87h7ynrA/EV;
User-Agent: Microsoft-Entourage/12.10.0.080409
Date: Mon, 01 Mar 2010 10:44:42 -0800
Subject: Re: Namespace partitioning using Locality Sensitive Hashing
From: Allen Wittenauer <awittenauer@linkedin.com>
To: <common-dev@hadoop.apache.org>
Message-ID: <C7B14B1A.3557%awittenauer@linkedin.com>
Thread-Topic: Namespace partitioning using Locality Sensitive Hashing
Thread-Index: Acq5b0E1SPeKEAgJHk6YPYJb/vBEgA==
In-Reply-To: <68432d881003010848m754a03c8vd279e7c9a90890c5@mail.gmail.com>
Mime-version: 1.0
Content-type: text/plain;
	charset="US-ASCII"
Content-transfer-encoding: 7bit


On 3/1/10 8:48 AM, "Ketan Dixit" <ketan.dixit@gmail.com> wrote:
> How  LSH is better than normal hashing?  Because still, a client or a fixed
> namenode has to take decision of which namenode to contact in whatever
> hashing ? It looks to me that requests to files under same subtree are
> directed to the same namenode then the performance will be faster as the
> requests to the same namenode are clustered around the a part of namespace
> subtree
> (For example a part of on which client is doing some operation.) Is this
> assumption correct? Can I have more insight in this regard.

IIRC, the thought process was this was a scalability feature, not being done
for performance.  There is a general reluctance by the HDFS dev's to only
store hot file metadata structures in memory.  So in order to prevent the
JVM's heap size from spiral out of control, using separate name spaces
allows you to divide and conquer.

With symlinks , this feature is essentially a solved problem.  The 'who is
the decision maker' issue is now the client's to resolve.  As an added
bonus, because it is URI based, the client may get pushed off to a
completely different service.  [This is definitely a feature--just think,
you can store really hot files on the local file system, completely
bypassing the overhead that HDFS incurs.]