Mailing-List: contact mapreduce-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-dev@hadoop.apache.org
Message-ID: <11244847.88931280506219437.JavaMail.jira@thor>
Date: Fri, 30 Jul 2010 12:10:19 -0400 (EDT)
From: "Fabrice Huet (JIRA)" <jira@apache.org>
To: mapreduce-dev@hadoop.apache.org
Subject: [jira] Created: (MAPREDUCE-1987) No verification on sample size can
 lead to incorrect partition file and "Split points are out of order"
 IOException
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

No verification on sample size can lead to incorrect partition file and "Split points are out of order" IOException
-------------------------------------------------------------------------------------------------------------------

                 Key: MAPREDUCE-1987
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1987
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 0.20.2
         Environment: 10 Linux machines with Hadoop 0.20.2 and JDK1.7.0
            Reporter: Fabrice Huet


If I understand correctly, the partition file should containt distinct values in increasing order.
In InputSampler.writePartitionFile (...)  if  the sample size is lower than the number of reduce size, the k index might keep the same value. As a side effet of the while loop, values will be interleaved.

Example : taking 100 samples on a 120 reducers job will produce the following values of k and last after the while loop 
    while (last >= k && comparator.compare(samples[last], samples[k]) == 0) {
        ++k;
      } 
   //display values here 

                 k 68                                                                                                                                                         
                 last 67        //correct                                                                                                                                              
                                                                                                                                             
                 k 69                                                                                                                                                         
                 last 68      //correct                                                                                                                                                
                                                                                                                                      
                 k 68                                                                                                                                                         
                 last 69    //incorrect, samples[68] has already been written                                                                                                                                                  
                                                                                                                                                
                 k 69                                                                                                                                                         
                 last 68    //incorrect, samples[69] has already been written         

The partition file will be considered as corrupted when reading it  with the TotalOrderPartitioner:
   throw new IOException("Split points are out of order");

It seems to me that the number of partitions should be min(samples.length,  job.getNumReduceTasks(), number of distinct values in sample)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.