hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Spreitzer <mspre...@us.ibm.com>
Subject Re: What is the right way to do map-side joins in Hadoop 1.0?
Date Sun, 15 Jan 2012 21:51:41 GMT
Yes, I did look at CompositeInputFormat.  That is why I remarked that I 
suppose that I should be looking under org.apache.hadoop.mapreduce.* and 
sent the earlier question about why CompositeInputFormat is not under 
org.apache.hadoop.mapreduce.* in Hadoop 1.0.0.  But I have gotten no 
answers yet.

And yes, I very much want to do the joins as 'early' as possible (i.e., in 
the mapper not the reducer); I do not want to waste work sending copies of 
the large constant dataset around if I do not have to.

BTW, my outline below was written too hastily; I see no obvious way to get 
the large constant dataset and the previous iteration's output placed 
consistently, which is going to be needed for top performance.


From:   Bejoy Ks <bejoy.hadoop@gmail.com>
To:     mapreduce-user@hadoop.apache.org
Date:   01/15/2012 07:49 AM
Subject:        Re: What is the right way to do map-side joins in Hadoop 

Hi Mark
           Have a look at CompositeInputFormat. I guess it is what you are 
looking for to achieve map side joins. If you are fine with a Reduce side 
join go in with MultipleInputFormat. I have tried the same sort of joins 
using  MultipleInputFormat and have scribbled something on the same. Check 
out if it'd be useful for you. (A very crude implementation :), you may 
have better ways )

Hope it helps!...


On Sun, Jan 15, 2012 at 4:34 PM, Mike Spreitzer <mspreitz@us.ibm.com> 
BTW, each key appears exactly once in the large constant dataset, and 
exactly once in each MR job's output. 

I am thinking the right approach is to consistently partition the job 
output and the large constant dataset, with the number of partitions being 
the number of reduce tasks; each part goes into its own file.  Make an 
InputFormat whose number of splits equals the number of reduce tasks. 
 Reading a split will consist of reading a corresponding pair of files, 
stepping through each.  Seems like something that should already be 
provided by something in org.apache.hadoop.mapreduce.*. 


View raw message