Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: unknown (athena.apache.org: error in processing during lookup of
 jason@attributor.com)
Message-ID: <487B5908.7000803@attributor.com>
Date: Mon, 14 Jul 2008 06:47:52 -0700
From: Jason Venner <jason@attributor.com>
User-Agent: Thunderbird 2.0.0.14 (X11/20080501)
MIME-Version: 1.0
To: core-user@hadoop.apache.org
Subject: Re: Is it possible to input two different files under same mapper
References: <4510BA80-0C37-4B90-BA17-2794E70D9C33@usc.edu>
 <AFD2C504-D235-43AA-B641-A1852D66A59D@apple.com>
In-Reply-To: <AFD2C504-D235-43AA-B641-A1852D66A59D@apple.com>
Content-Type: multipart/alternative;
 boundary="------------080604060801040704050707"

--------------080604060801040704050707
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

This sounds like a good task for the Data Join code.
If you can set up so that all of your data is stored in MapFiles, with 
the same type of key and the same partitioning setup and count, it will 
go very well.

Mori Bellamy wrote:
> Hey Amer,
> It sounds to me like you're going to have to write your own input 
> format (or atleast modify an existing one). Take a look here:
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileSplit.html 
>
>
> I'm not sure how you'd go about doing this, but i hope this helps you.
>
> (Also, have you considered preprocessing your input so that any 
> arbitrary mapper can know whether or not its looking at a line from 
> the "large file"?)
> On Jul 11, 2008, at 12:31 PM, Muhammad Ali Amer wrote:
>
>> HI,
>> My requirement is to compare the contents of one very large file (GB 
>> to TB size) with a bunch of smaller files (100s of MB to GB  sizes). 
>> Is there a way I can give the mapper the 1st file independently of 
>> the remaining bunch?
>> Amer
>
-- 
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested

--------------080604060801040704050707--