hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abc xyz <fabc_xyz...@yahoo.com>
Subject Re: Hashing two relations
Date Mon, 05 Jul 2010 07:55:22 GMT
The default reduce join is the sort-merge join. I want to have a hash-join on 
reduce-side for some experimenting. I want to get a partition from each 
hash-table and build an in-memory hash table for one and probing the partition 
from other table against it (like grace-join algorithm). Any suggestions 
would be highly appreciated.

From: Gang Luo <lgpublic@yahoo.com.cn>
To: common-user@hadoop.apache.org
Sent: Sun, July 4, 2010 10:04:03 PM
Subject: Re: Hashing two relations

not sure what you want.

If you want to do the join in reduce side, MapReduce framework enable this by 
grouping all the matching tuples together. Why bother to build hash table to 
buffer the entire partition in memory? This probably brings you a out-of-memory 
error. The default reduce join should be your choice in this case. 


----- 原始邮件 ----
发件人: abc xyz <fabc_xyz111@yahoo.com>
收件人: common-user@hadoop.apache.org
发送日期: 2010/7/3 (周六) 2:10:14 上午
主  题: Hashing two relations

Hey Folks,

I have to mess around with hashing. I want to take two input sources, partition 
them using hash function, then make the in-memory hash table for each partition 
of one sources, and compare the hash of each record of the same partition of the 

other table against it for joining these two. 

I know that map-side join does this (on pre-partitioned data), but I want to do 
it on reduce side. Using job-chaining, I can output (hash(key), value) by two 
map tasks on the two input files, but when it comes to the reduce stage, i have 
to take the same partition from both the hash tables. I am not sure how can I 
accomplish this. Any guidance in this regards would be appreciated.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message