Return-Path: Delivered-To: apmail-hadoop-mapreduce-dev-archive@minotaur.apache.org Received: (qmail 58367 invoked from network); 30 Jul 2010 16:10:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 30 Jul 2010 16:10:45 -0000 Received: (qmail 77893 invoked by uid 500); 30 Jul 2010 16:10:45 -0000 Delivered-To: apmail-hadoop-mapreduce-dev-archive@hadoop.apache.org Received: (qmail 77778 invoked by uid 500); 30 Jul 2010 16:10:44 -0000 Mailing-List: contact mapreduce-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-dev@hadoop.apache.org Received: (qmail 77769 invoked by uid 99); 30 Jul 2010 16:10:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Jul 2010 16:10:44 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Jul 2010 16:10:41 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o6UGAJ7j028485 for ; Fri, 30 Jul 2010 16:10:19 GMT Message-ID: <11244847.88931280506219437.JavaMail.jira@thor> Date: Fri, 30 Jul 2010 12:10:19 -0400 (EDT) From: "Fabrice Huet (JIRA)" To: mapreduce-dev@hadoop.apache.org Subject: [jira] Created: (MAPREDUCE-1987) No verification on sample size can lead to incorrect partition file and "Split points are out of order" IOException MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org No verification on sample size can lead to incorrect partition file and "Split points are out of order" IOException ------------------------------------------------------------------------------------------------------------------- Key: MAPREDUCE-1987 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1987 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.2 Environment: 10 Linux machines with Hadoop 0.20.2 and JDK1.7.0 Reporter: Fabrice Huet If I understand correctly, the partition file should containt distinct values in increasing order. In InputSampler.writePartitionFile (...) if the sample size is lower than the number of reduce size, the k index might keep the same value. As a side effet of the while loop, values will be interleaved. Example : taking 100 samples on a 120 reducers job will produce the following values of k and last after the while loop while (last >= k && comparator.compare(samples[last], samples[k]) == 0) { ++k; } //display values here k 68 last 67 //correct k 69 last 68 //correct k 68 last 69 //incorrect, samples[68] has already been written k 69 last 68 //incorrect, samples[69] has already been written The partition file will be considered as corrupted when reading it with the TotalOrderPartitioner: throw new IOException("Split points are out of order"); It seems to me that the number of partitions should be min(samples.length, job.getNumReduceTasks(), number of distinct values in sample) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.