hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Donofrio <donofrio...@gmail.com>
Subject cannot use a map side join to merge the output of multiple map side joins
Date Sat, 05 May 2012 15:50:08 GMT
I am trying to use a map side join to merge the output of multiple map 
side joins. This is failing because of the below code in 
JobClient.writeOldSplits which reorders the splits from largest to 
smallest. Why is that done, is that so that the largest split which will 
take the longest gets processed first?

Each map side join then fails to name its part-* files with the same 
number as the incoming partition so files that named part-00000 that go 
into the first map side join get outputted to part-00010 while another 
one of the first level map side joins sends files named part-00000 to 
part-00005. The second level map side join then does not get the input 
splits in partitioner order from each first level map side join output 

I can think of only 2 fixes, add some conf property to allow turning off 
the below sorting OR extend FileOutputCommitter to rename the outputs of 
the first level map side join to merge_part-the orginal partition 
number. Any other solutions?

     // sort the splits into order based on size, so that the biggest
     // go first
     Arrays.sort(splits, new 
Comparator<org.apache.hadoop.mapred.InputSplit>() {
       public int compare(org.apache.hadoop.mapred.InputSplit a,
                          org.apache.hadoop.mapred.InputSplit b) {
         try {
           long left = a.getLength();
           long right = b.getLength();
           if (left == right) {
             return 0;
           } else if (left < right) {
             return 1;
           } else {
             return -1;

View raw message