hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pmg <parmod.me...@gmail.com>
Subject Re: multiple file input
Date Fri, 19 Jun 2009 21:41:34 GMT

For the sake of simplification I have simplified my input into two files 1.
FileA 2. FileB

As I said earlier I want to compare every record of FileA against every
record in FileB I know this is n2 but this is the process. I wrote a simple
InputFormat and RecordReader. It seems each file is read serially one after
another. How can my record read have reference to both files at the same
line so that I can create cross list of FileA and FileB for the mapper.

Basically the way I see is to get mapper one record from FileA and all
records from FileB so that mapper can compare n2 and forward them to


pmg wrote:
> Thanks owen. Are there any examples that I can look at? 
> owen.omalley wrote:
>> On Jun 18, 2009, at 10:56 AM, pmg wrote:
>>> Each line from FileA gets compared with every line from FileB1,  
>>> FileB2 etc.
>>> etc. FileB1, FileB2 etc. are in a different input directory
>> In the general case, I'd define an InputFormat that takes two  
>> directories, computes the input splits for each directory and  
>> generates a new list of InputSplits that is the cross-product of the  
>> two lists. So instead of FileSplit, it would use a FileSplitPair that  
>> gives the FileSplit for dir1 and the FileSplit for dir2 and the record  
>> reader would return a TextPair with left and right records (ie.  
>> lines). Clearly, you read the first line of split1 and cross it by  
>> each line from split2, then move to the second line of split1 and  
>> process each line from split2, etc.
>> You'll need to ensure that you don't overwhelm the system with either  
>> too many input splits (ie. maps). Also don't forget that N^2/M grows  
>> much faster with the size of the input (N) than the M machines can  
>> handle in a fixed amount of time.
>>> Two input directories
>>> 1. input1 directory with a single file of 600K records - FileA
>>> 2. input2 directory segmented into different files with 2Million  
>>> records -
>>> FileB1, FileB2 etc.
>> In this particular case, it would be right to load all of FileA into  
>> memory and process the chunks of FileB/part-*. Then it would be much  
>> faster than needing to re-read the file over and over again, but  
>> otherwise it would be the same.
>> -- Owen

View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

View raw message