Hi Mark
Have a look at CompositeInputFormat. I guess it is what you are
looking for to achieve map side joins. If you are fine with a Reduce side
join go in with MultipleInputFormat. I have tried the same sort of joins
using MultipleInputFormat and have scribbled something on the same. Check
out if it'd be useful for you. (A very crude implementation :), you may
have better ways )
http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html
Hope it helps!...
Regards
Bejoy.K.S
On Sun, Jan 15, 2012 at 4:34 PM, Mike Spreitzer <mspreitz@us.ibm.com> wrote:
> BTW, each key appears exactly once in the large constant dataset, and
> exactly once in each MR job's output.
>
> I am thinking the right approach is to consistently partition the job
> output and the large constant dataset, with the number of partitions being
> the number of reduce tasks; each part goes into its own file. Make an
> InputFormat whose number of splits equals the number of reduce tasks.
> Reading a split will consist of reading a corresponding pair of files,
> stepping through each. Seems like something that should already be
> provided by something in org.apache.hadoop.mapreduce.*.
>
> Thanks,
> Mike
|