hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabrice Huet (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-1987) No verification on sample size can lead to incorrect partition file and "Split points are out of order" IOException
Date Fri, 30 Jul 2010 16:10:19 GMT
No verification on sample size can lead to incorrect partition file and "Split points are out
of order" IOException
-------------------------------------------------------------------------------------------------------------------

                 Key: MAPREDUCE-1987
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1987
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 0.20.2
         Environment: 10 Linux machines with Hadoop 0.20.2 and JDK1.7.0
            Reporter: Fabrice Huet


If I understand correctly, the partition file should containt distinct values in increasing
order.
In InputSampler.writePartitionFile (...)  if  the sample size is lower than the number of
reduce size, the k index might keep the same value. As a side effet of the while loop, values
will be interleaved.

Example : taking 100 samples on a 120 reducers job will produce the following values of k
and last after the while loop 
    while (last >= k && comparator.compare(samples[last], samples[k]) == 0) {
        ++k;
      } 
   //display values here 

                 k 68                                                                    
                                                                                    
                 last 67        //correct                                                
                                                                                         
   
                                                                                         
                                                   
                 k 69                                                                    
                                                                                    
                 last 68      //correct                                                  
                                                                                         
   
                                                                                         
                                            
                 k 68                                                                    
                                                                                    
                 last 69    //incorrect, samples[68] has already been written            
                                                                                         
                                           
                                                                                         
                                                      
                 k 69                                                                    
                                                                                    
                 last 68    //incorrect, samples[69] has already been written         

The partition file will be considered as corrupted when reading it  with the TotalOrderPartitioner:
   throw new IOException("Split points are out of order");

It seems to me that the number of partitions should be min(samples.length,  job.getNumReduceTasks(),
number of distinct values in sample)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message