hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From laser08150815 <la...@laserxyz.de>
Subject Re: multiple file input
Date Tue, 08 Dec 2009 14:19:23 GMT


pmg wrote:
> 
> I am evaluating hadoop for a problem that do a Cartesian product of input
> from one file of 600K (File A) with another set of file set (FileB1,
> FileB2, FileB3) with 2 millions line in total.
> 
> Each line from FileA gets compared with every line from FileB1, FileB2
> etc. etc. FileB1, FileB2 etc. are in a different input directory
> 
> So....
> 
> Two input directories 
> 
> 1. input1 directory with a single file of 600K records - FileA
> 2. input2 directory segmented into different files with 2Million records -
> FileB1, FileB2 etc.
> 
> How can I have a map that reads a line from a FileA in directory input1
> and compares the line with each line from input2? 
> 
> What is the best way forward? I have seen plenty of examples that maps
> each record from single input file and reduces into an output forward.
> 
> thanks
> 


I had a similar problem and solved it by writing a custom InputFormat (see
attachment). You should improve the methods ACrossBInputSplit.getLength ,
ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress.
-- 
View this message in context: http://old.nabble.com/multiple-file-input-tp24095358p26694569.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Mime
View raw message