Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 38827 invoked from network); 19 Jun 2009 04:38:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Jun 2009 04:38:47 -0000 Received: (qmail 36277 invoked by uid 500); 19 Jun 2009 04:38:56 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 36182 invoked by uid 500); 19 Jun 2009 04:38:55 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 36169 invoked by uid 99); 19 Jun 2009 04:38:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Jun 2009 04:38:55 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Jun 2009 04:38:46 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1MHVru-0002Fl-44 for core-user@hadoop.apache.org; Thu, 18 Jun 2009 21:38:26 -0700 Message-ID: <24105398.post@talk.nabble.com> Date: Thu, 18 Jun 2009 21:38:26 -0700 (PDT) From: pmg To: core-user@hadoop.apache.org Subject: Re: multiple file input In-Reply-To: <7760A9B5-1E46-4D1E-A0E6-45C5BB46BB94@apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: parmod.mehta@gmail.com References: <24095358.post@talk.nabble.com> <7760A9B5-1E46-4D1E-A0E6-45C5BB46BB94@apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Thanks owen. Are there any examples that I can look at? owen.omalley wrote: > > On Jun 18, 2009, at 10:56 AM, pmg wrote: > >> Each line from FileA gets compared with every line from FileB1, >> FileB2 etc. >> etc. FileB1, FileB2 etc. are in a different input directory > > In the general case, I'd define an InputFormat that takes two > directories, computes the input splits for each directory and > generates a new list of InputSplits that is the cross-product of the > two lists. So instead of FileSplit, it would use a FileSplitPair that > gives the FileSplit for dir1 and the FileSplit for dir2 and the record > reader would return a TextPair with left and right records (ie. > lines). Clearly, you read the first line of split1 and cross it by > each line from split2, then move to the second line of split1 and > process each line from split2, etc. > > You'll need to ensure that you don't overwhelm the system with either > too many input splits (ie. maps). Also don't forget that N^2/M grows > much faster with the size of the input (N) than the M machines can > handle in a fixed amount of time. > >> Two input directories >> >> 1. input1 directory with a single file of 600K records - FileA >> 2. input2 directory segmented into different files with 2Million >> records - >> FileB1, FileB2 etc. > > In this particular case, it would be right to load all of FileA into > memory and process the chunks of FileB/part-*. Then it would be much > faster than needing to re-read the file over and over again, but > otherwise it would be the same. > > -- Owen > > -- View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24105398.html Sent from the Hadoop core-user mailing list archive at Nabble.com.