hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pmg <parmod.me...@gmail.com>
Subject multiple file input
Date Thu, 18 Jun 2009 17:56:13 GMT

I am evaluating hadoop for a problem that do a Cartesian product of input
from one file of 600K (File A) with another set of file set (FileB1, FileB2,
FileB3) with 2 millions line in total.

Each line from FileA gets compared with every line from FileB1, FileB2 etc.
etc. FileB1, FileB2 etc. are in a different input directory


Two input directories 

1. input1 directory with a single file of 600K records - FileA
2. input2 directory segmented into different files with 2Million records -
FileB1, FileB2 etc.

How can I have a map that reads a line from a FileA in directory input1 and
compares the line with each line from input2? 

What is the best way forward? I have seen plenty of examples that maps each
record from single input file and reduces into an output forward.

View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24095358.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

View raw message