Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 37603 invoked from network); 14 Jul 2008 13:48:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Jul 2008 13:48:59 -0000 Received: (qmail 83454 invoked by uid 500); 14 Jul 2008 13:48:54 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 83419 invoked by uid 500); 14 Jul 2008 13:48:54 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 83408 invoked by uid 99); 14 Jul 2008 13:48:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Jul 2008 06:48:54 -0700 X-ASF-Spam-Status: No, hits=4.6 required=10.0 tests=DNS_FROM_OPENWHOIS,HTML_MESSAGE,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: unknown (athena.apache.org: error in processing during lookup of jason@attributor.com) Received: from [216.98.159.230] (HELO zimbra1.mindcentric.com) (216.98.159.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Jul 2008 13:47:59 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by zimbra1.mindcentric.com (Postfix) with ESMTP id 83EB622ACA64 for ; Mon, 14 Jul 2008 06:47:53 -0700 (PDT) X-Spam-Score: 0.933 X-Spam-Level: Received: from zimbra1.mindcentric.com ([127.0.0.1]) by localhost (zimbra1.mindcentric.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mpZtTyPz55pB for ; Mon, 14 Jul 2008 06:47:53 -0700 (PDT) Received: from [192.168.1.119] (unknown [76.14.55.51]) by zimbra1.mindcentric.com (Postfix) with ESMTP id 3010622ACA63 for ; Mon, 14 Jul 2008 06:47:53 -0700 (PDT) Message-ID: <487B5908.7000803@attributor.com> Date: Mon, 14 Jul 2008 06:47:52 -0700 From: Jason Venner User-Agent: Thunderbird 2.0.0.14 (X11/20080501) MIME-Version: 1.0 To: core-user@hadoop.apache.org Subject: Re: Is it possible to input two different files under same mapper References: <4510BA80-0C37-4B90-BA17-2794E70D9C33@usc.edu> In-Reply-To: Content-Type: multipart/alternative; boundary="------------080604060801040704050707" X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Flag: NO X-Old-Spam-Status: No, score=0.933 tagged_above=-10 required=6.6 tests=[AWL=-0.103, BAYES_00=-2.599, DNS_FROM_OPENWHOIS=1.13, HTML_MESSAGE=0.001, RCVD_IN_PBL=0.905, RDNS_NONE=0.1, WHOIS_MYPRIVREG=1.499] --------------080604060801040704050707 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit This sounds like a good task for the Data Join code. If you can set up so that all of your data is stored in MapFiles, with the same type of key and the same partitioning setup and count, it will go very well. Mori Bellamy wrote: > Hey Amer, > It sounds to me like you're going to have to write your own input > format (or atleast modify an existing one). Take a look here: > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileSplit.html > > > I'm not sure how you'd go about doing this, but i hope this helps you. > > (Also, have you considered preprocessing your input so that any > arbitrary mapper can know whether or not its looking at a line from > the "large file"?) > On Jul 11, 2008, at 12:31 PM, Muhammad Ali Amer wrote: > >> HI, >> My requirement is to compare the contents of one very large file (GB >> to TB size) with a bunch of smaller files (100s of MB to GB sizes). >> Is there a way I can give the mapper the 1st file independently of >> the remaining bunch? >> Amer > -- Jason Venner Attributor - Program the Web Attributor is hiring Hadoop Wranglers and coding wizards, contact if interested --------------080604060801040704050707--