Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 29997 invoked from network); 8 Dec 2009 14:19:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Dec 2009 14:19:49 -0000 Received: (qmail 1964 invoked by uid 500); 8 Dec 2009 14:19:46 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 1877 invoked by uid 500); 8 Dec 2009 14:19:46 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 1867 invoked by uid 500); 8 Dec 2009 14:19:46 -0000 Delivered-To: apmail-hadoop-core-user@hadoop.apache.org Received: (qmail 1864 invoked by uid 99); 8 Dec 2009 14:19:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Dec 2009 14:19:46 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Dec 2009 14:19:44 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1NI0uR-0008P7-O1 for core-user@hadoop.apache.org; Tue, 08 Dec 2009 06:19:23 -0800 Message-ID: <26694569.post@talk.nabble.com> Date: Tue, 8 Dec 2009 06:19:23 -0800 (PST) From: laser08150815 To: core-user@hadoop.apache.org Subject: Re: multiple file input MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: laser@laserxyz.de pmg wrote: > > I am evaluating hadoop for a problem that do a Cartesian product of input > from one file of 600K (File A) with another set of file set (FileB1, > FileB2, FileB3) with 2 millions line in total. > > Each line from FileA gets compared with every line from FileB1, FileB2 > etc. etc. FileB1, FileB2 etc. are in a different input directory > > So.... > > Two input directories > > 1. input1 directory with a single file of 600K records - FileA > 2. input2 directory segmented into different files with 2Million records - > FileB1, FileB2 etc. > > How can I have a map that reads a line from a FileA in directory input1 > and compares the line with each line from input2? > > What is the best way forward? I have seen plenty of examples that maps > each record from single input file and reduces into an output forward. > > thanks > I had a similar problem and solved it by writing a custom InputFormat (see attachment). You should improve the methods ACrossBInputSplit.getLength , ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress. -- View this message in context: http://old.nabble.com/multiple-file-input-tp24095358p26694569.html Sent from the Hadoop core-user mailing list archive at Nabble.com.