Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of niteshbhatia008@gmail.com
 designates 209.85.221.134 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=N4RLZOrMkH3wTtWcgbBGeygAhnhiQY1u8EJ521YYhZuDB+yQpOvuaHfu7Gd67KnKgR
         rVosB4WpdNIfwn/BdEFV24MANc6jqeSGDKnonmMizFKn8JOMZ0EM0HCvinBRgGEU5K1E
         lr8EzAJupEQhEzok3FhYbtgQHYFAJwIm78/CI=
MIME-Version: 1.0
In-Reply-To: <314098690904051102r237b0836y109de759ab9634a@mail.gmail.com>
References: <49D7CD04.7030708@nbi.dk> <p06240810c5fd826625c8@192.168.1.43>
	 <314098690904041658j4417ee98v30c85263383eb4e9@mail.gmail.com>
	 <49D8C378.7020402@nbi.dk>
	 <314098690904051102r237b0836y109de759ab9634a@mail.gmail.com>
Date: Mon, 6 Apr 2009 01:35:29 +0530
Message-ID: <df2f0d970904051305x5149940dmb1943c68dfc0544@mail.gmail.com>
Subject: Re: joining two large files in hadoop
From: nitesh bhatia <niteshbhatia008@gmail.com>
To: core-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi
Pig (Hadoop-subproject) can serve the best option for these kind of
problems. I suggest you to take a look.

--nitesh


On Sun, Apr 5, 2009 at 11:32 PM, jason hadoop <jason.hadoop@gmail.com> wrot=
e:
> Alpha chapters are available, and 8 should be available in the alpha's as
> soon as draft one gets back from technical review.
>
> On Sun, Apr 5, 2009 at 7:43 AM, Christian Ulrik S=F8ttrup <soettrup@nbi.d=
k>wrote:
>
>> jason hadoop wrote:
>>
>>> This is discussed in chapter 8 of my book.
>>>
>>>
>> What book? Is it out?
>>
>> =A0In short,
>>> If both data sets are:
>>>
>>> =A0 - in same key order
>>> =A0 - partitioned with the same partitioner,
>>> =A0 - the input format of each data set is the same, (necessary for thi=
s
>>> =A0 simple example only)
>>>
>>> A map side join will present all the key value pairs of each partition,=
 to
>>> a
>>> single map task, in key order,
>>> Path dir1 =3D=3D the directory containing the part-XXXXX files for data=
 set 1
>>> Path dir2 =3D=3D The directory containing the part-XXXXX files for data=
 set 2
>>> and use CompositeInputFormat.compose to build the join statement
>>>
>>> set the InputFormat to CompositeInputFormat,
>>> conf.setInputFormat(CompositeInputFormat.class);
>>>
>>> String joinStatement =3D CompositeInputFormat.compose("inner", dir1, di=
r2);
>>> conf.set('mapred.join.expr", joinStatement);
>>>
>>> The value classfor your map method will be TupleWritable
>>> In the map method,
>>>
>>> =A0 - value.has(x) indicates if the Xth ordinal data set has a value fo=
r
>>> this
>>> =A0 key
>>> =A0 - value.get(x) returns the value from the Xth ordinal data set for =
this
>>> =A0 key
>>> =A0 - value.size() returns the number of data sets in the join
>>>
>>> In our example, dir1 would be ordinal 0, and dir2 would be ordinal 1.
>>>
>>>
>> The partitioner is normally used for the reduce step but here it will be
>> used already at the mapper stage?
>>
>> Basically my files look like:
>> id<tab>matrix
>> id2<tab>anothermatrix
>> and
>> id<tab>vector1
>> id<tab>vector2
>> id2<tab>vector3
>>
>> id is just an integer and there is only one matrix but many vectors tied=
 to
>> the same id.
>> I just want the values from both files that has the same id.
>> Do I need a partitioner in this case? What happens if the file is split
>> into blocks such that two blocks
>> contain entries with the same key?
>>
>> Am I right if what happens is that using the example above the mapper wi=
ll
>> be called three times with:
>> key=3Did =A0 tuple=3D(matrix,vector1)
>> key=3Did =A0 tuple=3D(matrix,vector2)
>> key=3Did2 tuple=3D(anothermatix,vector3)
>>
>> cheers,
>> Christian
>>
>>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>


--=20
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun