hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nitesh bhatia <niteshbhatia...@gmail.com>
Subject Re: joining two large files in hadoop
Date Sun, 05 Apr 2009 20:05:29 GMT
Hi
Pig (Hadoop-subproject) can serve the best option for these kind of
problems. I suggest you to take a look.

--nitesh



On Sun, Apr 5, 2009 at 11:32 PM, jason hadoop <jason.hadoop@gmail.com> wrote:
> Alpha chapters are available, and 8 should be available in the alpha's as
> soon as draft one gets back from technical review.
>
> On Sun, Apr 5, 2009 at 7:43 AM, Christian Ulrik Søttrup <soettrup@nbi.dk>wrote:
>
>> jason hadoop wrote:
>>
>>> This is discussed in chapter 8 of my book.
>>>
>>>
>> What book? Is it out?
>>
>>  In short,
>>> If both data sets are:
>>>
>>>   - in same key order
>>>   - partitioned with the same partitioner,
>>>   - the input format of each data set is the same, (necessary for this
>>>   simple example only)
>>>
>>> A map side join will present all the key value pairs of each partition, to
>>> a
>>> single map task, in key order,
>>> Path dir1 == the directory containing the part-XXXXX files for data set 1
>>> Path dir2 == The directory containing the part-XXXXX files for data set 2
>>> and use CompositeInputFormat.compose to build the join statement
>>>
>>> set the InputFormat to CompositeInputFormat,
>>> conf.setInputFormat(CompositeInputFormat.class);
>>>
>>> String joinStatement = CompositeInputFormat.compose("inner", dir1, dir2);
>>> conf.set('mapred.join.expr", joinStatement);
>>>
>>> The value classfor your map method will be TupleWritable
>>> In the map method,
>>>
>>>   - value.has(x) indicates if the Xth ordinal data set has a value for
>>> this
>>>   key
>>>   - value.get(x) returns the value from the Xth ordinal data set for this
>>>   key
>>>   - value.size() returns the number of data sets in the join
>>>
>>> In our example, dir1 would be ordinal 0, and dir2 would be ordinal 1.
>>>
>>>
>> The partitioner is normally used for the reduce step but here it will be
>> used already at the mapper stage?
>>
>> Basically my files look like:
>> id<tab>matrix
>> id2<tab>anothermatrix
>> and
>> id<tab>vector1
>> id<tab>vector2
>> id2<tab>vector3
>>
>> id is just an integer and there is only one matrix but many vectors tied to
>> the same id.
>> I just want the values from both files that has the same id.
>> Do I need a partitioner in this case? What happens if the file is split
>> into blocks such that two blocks
>> contain entries with the same key?
>>
>> Am I right if what happens is that using the example above the mapper will
>> be called three times with:
>> key=id   tuple=(matrix,vector1)
>> key=id   tuple=(matrix,vector2)
>> key=id2 tuple=(anothermatix,vector3)
>>
>> cheers,
>> Christian
>>
>>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>



-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun

Mime
View raw message