hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shi Yu <sh...@uchicago.edu>
Subject one-to-many Map Side Join without reducer
Date Thu, 09 Jun 2011 21:09:45 GMT
Hi,

I have two datasets: dataset 1 has the format:

MasterKey1    SubKey1    SubKey2    SubKey3
MasterKey2    Subkey4     Subkey5     Subkey6
....


dataset 2 has the format:

SubKey1    Value1
SubKey2    Value2
...

I want to have one-to-many join based on the SubKey, and the final goal 
is to have an output like:

MasterKey1    Value1    Value2    Value3
MasterKey2    Value4    Value5    Value6
...


After studying and experimenting some example code, I understand that it 
is doable if I transform the first data set as

SubKey1    MasterKey1
SubKey2    MasterKey1
SubKey3    MasterKey1
SubKey4    MasterKey2
SubKey5    MasterKey2
SubKey6    MasterKey2

then using the inner join with the dataset 2 on SubKey. Then I probably 
need a reducer to perform secondary sort on MasterKey to get the result. 
However, the bottleneck is still on the reducer if each MasterKey has 
lots of SubKey.
My question is, suppose that dataset2 contains all the Subkeys and never 
split, is it possible to join the key of dataset 2 with multiple values 
of dataset 1 at the Mapper Side? Any hint is highly appreciated.

Shi



Mime
View raw message