hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <timrobertson...@gmail.com>
Subject Newbie: Inner join - reduce side
Date Thu, 12 Nov 2009 15:14:28 GMT
Hi all,

I have 2 KVP files of 200million+ rows, and plan to do a reduce side
join (my first...).

Input 1
----------
KEY  TC_ID

Input 2
----------
KEY  OCC_ID

I aim to produce an output of:

Output
----------
OCC_ID  TC_ID       (if there are any many2many I would flag an error)


My plan was to indicate in the map which source each ID came from
(e.g. emit tc-123 or occ-234 depending on the input source), and then
in the reduce pull out the records.

Can someone please sanity check if this approach is sound?  I am
pretty sure there should be something existing I can use, but I can't
find it.

Can I determine in the Map which input file the record is coming from
or do I need multiple jobs?

Many thanks,
Tim

Mime
View raw message