hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "springring" <springr...@126.com>
Subject Re: Namespace partitioning using Locality Sensitive Hashing
Date Tue, 02 Mar 2010 14:03:27 GMT
I have a question.

Now that hadoop have "maper" and "reducer" , how about the solution like Map<Tree(nodepair
set), root(node) > and Reduce<root(node), Tree(nodepair set)>
or that directly Reduce<root(node), nodepair> here nodepair can be look as a branch...

----- Original Message ----- 
From: "Konstantin Shvachko" <shv@yahoo-inc.com>
To: <common-dev@hadoop.apache.org>
Sent: Tuesday, March 02, 2010 10:21 AM
Subject: Re: Namespace partitioning using Locality Sensitive Hashing

> Symlinks is a brand new feature in HDFS.
> You can read about it in
> https://issues.apache.org/jira/browse/HDFS-245
> Documentation is here:
> https://issues.apache.org/jira/secure/attachment/12434745/design-doc-v4.txt
> Symbolic links in HDFS can point to a directory in a different file system,
> particularly on a different HDSF cluster. They are like mount points in this case.
> So you can create a symlink on cluster C1 pointing to the root of cluster C2.
> This makes C2 a sub-namespace of C1.
> --Konstantin
> On 3/1/2010 5:42 PM, Ketan Dixit wrote:
>> Hello,
>> Thank you Konstantin and  Allen for your reply. The information
>> provided really helped to improve my understanding.
>> However I still have few questions.
>> How Symlinks/ soft links are used to solve the probem of partitioning.
>> (Where do the symlinks point to? All the mapping is
>> stored in memory but symlinks point to file objects? This is little
>> confusing to me)
>> Can you please provide insight into this?
>> Thanks,
>> Ketan
>> On Mon, Mar 1, 2010 at 3:26 PM, Konstantin Shvachko<shv@yahoo-inc.com>  wrote:
>>> Hi Ketan,
>>> AFAIU, hashing is used to map files and directories into different name-nodes.
>>> Suppose you use a simple hash function on a file path h(path), and that files
>>> with the same hash value (or within a hash range) are mapped to the same name-node.
>>> Then files with the same parent will be randomly mapped into different
>>> name-nodes: Pr(h(/dir/file1) = h(/dir/file2)) - is small.
>>> The ides with LSH is to add some locality factor in the hash function in order
>>> to increase probability of placing files from the same directory (or a subtree)
>>> into the same name-node.
>>> Example 1.
>>> Suppose that you apply MD5 only to the path to the parent rather
>>> then to the entire file path: h(/root/dir/file) = MD5(/root/dir)
>>> Then all files of the same directory will have the same hash value and therefore
>>> will be mapped into the same name-node.
>>> Example 2.
>>> If a path consists of path components pi, where i = 1,..,n
>>> Lets define the following hash function:
>>> h(/p1/.../pn) = 0, if n<  7
>>> h(/p1/.../pn) = MD5(/p1/.../p7), if n>= 7
>>> With this hash function each subtree rooted at level 7 of the namepsace hierarchy
>>> will be entirely mapped to the same name-node.
>>> There could be more elaborate examples.
>>> Symlinks do provide a way to partition the namespace, as Allen points out,
>>> although this is a static partitioning. Static partitions as opposed to dynamic
>>> ones do not guarantee that the partitions will be "equal" in size, where "size"
>>> may have different meanings (like number of files, or space occupied by the
>>> files, or number of blocks).
>>> A good hash function need to conform to some equal partitioning requirement.
>>> Function from Example 2 would be considered bad in this sense, while Example
>>> defines a good function.
>>> This is my take on the problem. Hope it makes sense,
>>> --Konstantin
>>> On 3/1/2010 8:48 AM, Ketan Dixit wrote:
>>>> Hi,
>>>> I am a graduate student in Computer Science department at SUNY Stony Brook.
>>>>   I am thinking of doing a project on Hadoop for my course "Cloud Computing"
>>>> conducted by Prof. Radu Sion.
>>>> While going through the links of the "Yahoo open source projects for
>>>> students"  page I found the idea
>>>> "Research on new hashing schemes for filesystem namespace partitioning"
>>>> interesting. It looks to me the idea is
>>>> to assign subtree of the whole namespace to one namenode and another subtree
>>>> to another namenode.
>>>> How  LSH is better than normal hashing?  Because still, a client or a fixed
>>>> namenode has to take decision of which namenode to contact in whatever
>>>> hashing ? It looks to me that requests to files under same subtree are
>>>> directed to the same namenode then the performance will be faster as the
>>>> requests to the same namenode are clustered around the a part of namespace
>>>> subtree
>>>> (For example a part of on which client is doing some operation.) Is this
>>>> assumption correct? Can I have more insight in this regard.
>>>> Thanks,
>>>> Ketan
View raw message