hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Ulrik S√łttrup <soett...@nbi.dk>
Subject Re: joining two large files in hadoop
Date Sun, 05 Apr 2009 14:43:04 GMT
jason hadoop wrote:
> This is discussed in chapter 8 of my book.
>   
What book? Is it out?

> In short,
> If both data sets are:
>
>    - in same key order
>    - partitioned with the same partitioner,
>    - the input format of each data set is the same, (necessary for this
>    simple example only)
>
> A map side join will present all the key value pairs of each partition, to a
> single map task, in key order,
> Path dir1 == the directory containing the part-XXXXX files for data set 1
> Path dir2 == The directory containing the part-XXXXX files for data set 2
> and use CompositeInputFormat.compose to build the join statement
>
> set the InputFormat to CompositeInputFormat,
> conf.setInputFormat(CompositeInputFormat.class);
>
> String joinStatement = CompositeInputFormat.compose("inner", dir1, dir2);
> conf.set('mapred.join.expr", joinStatement);
>
> The value classfor your map method will be TupleWritable
> In the map method,
>
>    - value.has(x) indicates if the Xth ordinal data set has a value for this
>    key
>    - value.get(x) returns the value from the Xth ordinal data set for this
>    key
>    - value.size() returns the number of data sets in the join
>
> In our example, dir1 would be ordinal 0, and dir2 would be ordinal 1.
>   
The partitioner is normally used for the reduce step but here it will be 
used already at the mapper stage?

Basically my files look like:
id<tab>matrix
id2<tab>anothermatrix
and
id<tab>vector1
id<tab>vector2
id2<tab>vector3

id is just an integer and there is only one matrix but many vectors tied 
to the same id.
I just want the values from both files that has the same id.
Do I need a partitioner in this case? What happens if the file is split 
into blocks such that two blocks
contain entries with the same key?

Am I right if what happens is that using the example above the mapper 
will be called three times with:
key=id   tuple=(matrix,vector1)
key=id   tuple=(matrix,vector2)
key=id2 tuple=(anothermatix,vector3)

cheers,
Christian


Mime
View raw message