Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 47801 invoked from network); 11 Oct 2007 13:21:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Oct 2007 13:21:13 -0000 Received: (qmail 31149 invoked by uid 500); 11 Oct 2007 13:20:59 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 31123 invoked by uid 500); 11 Oct 2007 13:20:59 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 31109 invoked by uid 99); 11 Oct 2007 13:20:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Oct 2007 06:20:59 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Oct 2007 13:21:11 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 323BC714240 for ; Thu, 11 Oct 2007 06:20:51 -0700 (PDT) Message-ID: <11201030.1192108851203.JavaMail.jira@brutus> Date: Thu, 11 Oct 2007 06:20:51 -0700 (PDT) From: "Runping Qi (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Created: (HADOOP-2032) distcp split generation does not work correctly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org distcp split generation does not work correctly ----------------------------------------------- Key: HADOOP-2032 URL: https://issues.apache.org/jira/browse/HADOOP-2032 Project: Hadoop Issue Type: Bug Components: util Reporter: Runping Qi With the current implementation, distcp will always assign multiple files to one mapper to copy, no matter how large are the files. This is because the CopyFiles class uses a sequencefile to store the list of files to be copied, one record per file. CopyFile class correctly generates one split per record in the sequence file. However, due to the way the sequence file record reader works, the minimum unit for splits is the segments between the "syncmarks" in the sequence file. This results in the strange behavior that some mappers get zero records (zero files to copy) even though their split lengths are non-zero, while other mappers get multiple records (multiple filesto copy) from their split (and beyond to the next sync mark). When CopyFile class creates the sequencefile, it does try to place a sync mark between splitable segments in the sequence file by calling sync() function of the sequence file record writer. Unfortunately, the sync() function is a no-op for files that are not block compressed. Naturally, after I changed the compression type for the sequence file to block compression, mappers got the correct records from their splits. So a simple fix is to change the compression tye to CompressionType.BLOCK: {code} // create src list SequenceFile.Writer writer = SequenceFile.createWriter( jobDirectory.getFileSystem(jobConf), jobConf, srcfilelist, LongWritable.class, FilePair.class, SequenceFile.CompressionType.BLOCK);. {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.