Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates
 216.139.236.158 as permitted sender)
Message-ID: <26694569.post@talk.nabble.com>
Date: Tue, 8 Dec 2009 06:19:23 -0800 (PST)
From: laser08150815 <laser@laserxyz.de>
To: core-user@hadoop.apache.org
Subject: Re: multiple file input
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit


pmg wrote:
> 
> I am evaluating hadoop for a problem that do a Cartesian product of input
> from one file of 600K (File A) with another set of file set (FileB1,
> FileB2, FileB3) with 2 millions line in total.
> 
> Each line from FileA gets compared with every line from FileB1, FileB2
> etc. etc. FileB1, FileB2 etc. are in a different input directory
> 
> So....
> 
> Two input directories 
> 
> 1. input1 directory with a single file of 600K records - FileA
> 2. input2 directory segmented into different files with 2Million records -
> FileB1, FileB2 etc.
> 
> How can I have a map that reads a line from a FileA in directory input1
> and compares the line with each line from input2? 
> 
> What is the best way forward? I have seen plenty of examples that maps
> each record from single input file and reduces into an output forward.
> 
> thanks
> 


I had a similar problem and solved it by writing a custom InputFormat (see
attachment). You should improve the methods ACrossBInputSplit.getLength ,
ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress.
-- 
View this message in context: http://old.nabble.com/multiple-file-input-tp24095358p26694569.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.