hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason hadoop <jason.had...@gmail.com>
Subject Re: joining two large files in hadoop
Date Sun, 05 Apr 2009 18:02:55 GMT
Alpha chapters are available, and 8 should be available in the alpha's as
soon as draft one gets back from technical review.

On Sun, Apr 5, 2009 at 7:43 AM, Christian Ulrik S√łttrup <soettrup@nbi.dk>wrote:

> jason hadoop wrote:
>
>> This is discussed in chapter 8 of my book.
>>
>>
> What book? Is it out?
>
>  In short,
>> If both data sets are:
>>
>>   - in same key order
>>   - partitioned with the same partitioner,
>>   - the input format of each data set is the same, (necessary for this
>>   simple example only)
>>
>> A map side join will present all the key value pairs of each partition, to
>> a
>> single map task, in key order,
>> Path dir1 == the directory containing the part-XXXXX files for data set 1
>> Path dir2 == The directory containing the part-XXXXX files for data set 2
>> and use CompositeInputFormat.compose to build the join statement
>>
>> set the InputFormat to CompositeInputFormat,
>> conf.setInputFormat(CompositeInputFormat.class);
>>
>> String joinStatement = CompositeInputFormat.compose("inner", dir1, dir2);
>> conf.set('mapred.join.expr", joinStatement);
>>
>> The value classfor your map method will be TupleWritable
>> In the map method,
>>
>>   - value.has(x) indicates if the Xth ordinal data set has a value for
>> this
>>   key
>>   - value.get(x) returns the value from the Xth ordinal data set for this
>>   key
>>   - value.size() returns the number of data sets in the join
>>
>> In our example, dir1 would be ordinal 0, and dir2 would be ordinal 1.
>>
>>
> The partitioner is normally used for the reduce step but here it will be
> used already at the mapper stage?
>
> Basically my files look like:
> id<tab>matrix
> id2<tab>anothermatrix
> and
> id<tab>vector1
> id<tab>vector2
> id2<tab>vector3
>
> id is just an integer and there is only one matrix but many vectors tied to
> the same id.
> I just want the values from both files that has the same id.
> Do I need a partitioner in this case? What happens if the file is split
> into blocks such that two blocks
> contain entries with the same key?
>
> Am I right if what happens is that using the example above the mapper will
> be called three times with:
> key=id   tuple=(matrix,vector1)
> key=id   tuple=(matrix,vector2)
> key=id2 tuple=(anothermatix,vector3)
>
> cheers,
> Christian
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message