hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Shvachko <...@yahoo-inc.com>
Subject Re: Namespace partitioning using Locality Sensitive Hashing
Date Mon, 01 Mar 2010 20:26:58 GMT
Hi Ketan,

AFAIU, hashing is used to map files and directories into different name-nodes.
Suppose you use a simple hash function on a file path h(path), and that files
with the same hash value (or within a hash range) are mapped to the same name-node.
Then files with the same parent will be randomly mapped into different
name-nodes: Pr(h(/dir/file1) = h(/dir/file2)) - is small.

The ides with LSH is to add some locality factor in the hash function in order
to increase probability of placing files from the same directory (or a subtree)
into the same name-node.

Example 1.
Suppose that you apply MD5 only to the path to the parent rather
then to the entire file path: h(/root/dir/file) = MD5(/root/dir)
Then all files of the same directory will have the same hash value and therefore
will be mapped into the same name-node.

Example 2.
If a path consists of path components pi, where i = 1,..,n
Lets define the following hash function:
h(/p1/.../pn) = 0, if n < 7
h(/p1/.../pn) = MD5(/p1/.../p7), if n >= 7
With this hash function each subtree rooted at level 7 of the namepsace hierarchy
will be entirely mapped to the same name-node.

There could be more elaborate examples.

Symlinks do provide a way to partition the namespace, as Allen points out,
although this is a static partitioning. Static partitions as opposed to dynamic
ones do not guarantee that the partitions will be "equal" in size, where "size"
may have different meanings (like number of files, or space occupied by the
files, or number of blocks).
A good hash function need to conform to some equal partitioning requirement.
Function from Example 2 would be considered bad in this sense, while Example 1
defines a good function.

This is my take on the problem. Hope it makes sense,

On 3/1/2010 8:48 AM, Ketan Dixit wrote:
> Hi,
> I am a graduate student in Computer Science department at SUNY Stony Brook.
>   I am thinking of doing a project on Hadoop for my course "Cloud Computing"
> conducted by Prof. Radu Sion.
> While going through the links of the "Yahoo open source projects for
> students"  page I found the idea
> "Research on new hashing schemes for filesystem namespace partitioning"
> interesting. It looks to me the idea is
> to assign subtree of the whole namespace to one namenode and another subtree
> to another namenode.
> How  LSH is better than normal hashing?  Because still, a client or a fixed
> namenode has to take decision of which namenode to contact in whatever
> hashing ? It looks to me that requests to files under same subtree are
> directed to the same namenode then the performance will be faster as the
> requests to the same namenode are clustered around the a part of namespace
> subtree
> (For example a part of on which client is doing some operation.) Is this
> assumption correct? Can I have more insight in this regard.
> Thanks,
> Ketan

View raw message