Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates
 216.139.236.158 as permitted sender)
Message-ID: <24105398.post@talk.nabble.com>
Date: Thu, 18 Jun 2009 21:38:26 -0700 (PDT)
From: pmg <parmod.mehta@gmail.com>
To: core-user@hadoop.apache.org
Subject: Re: multiple file input
In-Reply-To: <7760A9B5-1E46-4D1E-A0E6-45C5BB46BB94@apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
References: <24095358.post@talk.nabble.com>
 <7760A9B5-1E46-4D1E-A0E6-45C5BB46BB94@apache.org>


Thanks owen. Are there any examples that I can look at? 


owen.omalley wrote:
> 
> On Jun 18, 2009, at 10:56 AM, pmg wrote:
> 
>> Each line from FileA gets compared with every line from FileB1,  
>> FileB2 etc.
>> etc. FileB1, FileB2 etc. are in a different input directory
> 
> In the general case, I'd define an InputFormat that takes two  
> directories, computes the input splits for each directory and  
> generates a new list of InputSplits that is the cross-product of the  
> two lists. So instead of FileSplit, it would use a FileSplitPair that  
> gives the FileSplit for dir1 and the FileSplit for dir2 and the record  
> reader would return a TextPair with left and right records (ie.  
> lines). Clearly, you read the first line of split1 and cross it by  
> each line from split2, then move to the second line of split1 and  
> process each line from split2, etc.
> 
> You'll need to ensure that you don't overwhelm the system with either  
> too many input splits (ie. maps). Also don't forget that N^2/M grows  
> much faster with the size of the input (N) than the M machines can  
> handle in a fixed amount of time.
> 
>> Two input directories
>>
>> 1. input1 directory with a single file of 600K records - FileA
>> 2. input2 directory segmented into different files with 2Million  
>> records -
>> FileB1, FileB2 etc.
> 
> In this particular case, it would be right to load all of FileA into  
> memory and process the chunks of FileB/part-*. Then it would be much  
> faster than needing to re-read the file over and over again, but  
> otherwise it would be the same.
> 
> -- Owen
> 
> 

-- 
View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24105398.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.