hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <timrobertson...@gmail.com>
Subject Re: Newbie: Inner join - reduce side
Date Thu, 12 Nov 2009 15:19:36 GMT
Ok, I missed the org.apache.hadoop.contrib.utils.join which obviously
does this exact thing...

Sorry, answering my own question
Tim


On Thu, Nov 12, 2009 at 4:14 PM, Tim Robertson
<timrobertson100@gmail.com> wrote:
> Hi all,
>
> I have 2 KVP files of 200million+ rows, and plan to do a reduce side
> join (my first...).
>
> Input 1
> ----------
> KEY  TC_ID
>
> Input 2
> ----------
> KEY  OCC_ID
>
> I aim to produce an output of:
>
> Output
> ----------
> OCC_ID  TC_ID       (if there are any many2many I would flag an error)
>
>
> My plan was to indicate in the map which source each ID came from
> (e.g. emit tc-123 or occ-234 depending on the input source), and then
> in the reduce pull out the records.
>
> Can someone please sanity check if this approach is sound?  I am
> pretty sure there should be something existing I can use, but I can't
> find it.
>
> Can I determine in the Map which input file the record is coming from
> or do I need multiple jobs?
>
> Many thanks,
> Tim
>

Mime
View raw message