hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Ulrik S√łttrup <soett...@nbi.dk>
Subject Re: joining two large files in hadoop
Date Sun, 05 Apr 2009 14:43:04 GMT
jason hadoop wrote:
> This is discussed in chapter 8 of my book.
What book? Is it out?

> In short,
> If both data sets are:
>    - in same key order
>    - partitioned with the same partitioner,
>    - the input format of each data set is the same, (necessary for this
>    simple example only)
> A map side join will present all the key value pairs of each partition, to a
> single map task, in key order,
> Path dir1 == the directory containing the part-XXXXX files for data set 1
> Path dir2 == The directory containing the part-XXXXX files for data set 2
> and use CompositeInputFormat.compose to build the join statement
> set the InputFormat to CompositeInputFormat,
> conf.setInputFormat(CompositeInputFormat.class);
> String joinStatement = CompositeInputFormat.compose("inner", dir1, dir2);
> conf.set('mapred.join.expr", joinStatement);
> The value classfor your map method will be TupleWritable
> In the map method,
>    - value.has(x) indicates if the Xth ordinal data set has a value for this
>    key
>    - value.get(x) returns the value from the Xth ordinal data set for this
>    key
>    - value.size() returns the number of data sets in the join
> In our example, dir1 would be ordinal 0, and dir2 would be ordinal 1.
The partitioner is normally used for the reduce step but here it will be 
used already at the mapper stage?

Basically my files look like:

id is just an integer and there is only one matrix but many vectors tied 
to the same id.
I just want the values from both files that has the same id.
Do I need a partitioner in this case? What happens if the file is split 
into blocks such that two blocks
contain entries with the same key?

Am I right if what happens is that using the example above the mapper 
will be called three times with:
key=id   tuple=(matrix,vector1)
key=id   tuple=(matrix,vector2)
key=id2 tuple=(anothermatix,vector3)


View raw message