hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject Re: Hashing two relations
Date Sun, 04 Jul 2010 21:04:03 GMT
not sure what you want.

If you want to do the join in reduce side, MapReduce framework enable this by grouping all
the matching tuples together. Why bother to build hash table to buffer the entire partition
in memory? This probably brings you a out-of-memory error. The default reduce join should
be your choice in this case. 


----- 原始邮件 ----
发件人: abc xyz <fabc_xyz111@yahoo.com>
收件人: common-user@hadoop.apache.org
发送日期: 2010/7/3 (周六) 2:10:14 上午
主   题: Hashing two relations

Hey Folks,

I have to mess around with hashing. I want to take two input sources, partition 
them using hash function, then make the in-memory hash table for each partition 
of one sources, and compare the hash of each record of the same partition of the 
other table against it for joining these two. 

I know that map-side join does this (on pre-partitioned data), but I want to do 
it on reduce side. Using job-chaining, I can output (hash(key), value) by two 
map tasks on the two input files, but when it comes to the reduce stage, i have 
to take the same partition from both the hash tables. I am not sure how can I 
accomplish this. Any guidance in this regards would be appreciated.



View raw message