Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <10206780.1192653530853.JavaMail.jira@brutus>
Date: Wed, 17 Oct 2007 13:38:50 -0700 (PDT)
From: "Chris Douglas (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Resolved: (HADOOP-2032) distcp split generation does not
 work correctly
In-Reply-To: <11201030.1192108851203.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HADOOP-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas resolved HADOOP-2032.
-----------------------------------

    Resolution: Duplicate

Fixed by HADOOP-2033

> distcp split generation does not work correctly
> -----------------------------------------------
>
>                 Key: HADOOP-2032
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2032
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>            Reporter: Runping Qi
>
> With the current implementation, distcp will always assign multiple files to one mapper to copy, no matter how large 
> are the files. This is because the CopyFiles class uses a sequencefile to store the list of files to be copied, 
> one record per file. CopyFile class correctly generates one split per record in the sequence file. However, 
> due to  the way the sequence file record reader works, the minimum unit for splits is the segments between the 
> "syncmarks" in the sequence file. 
> This results in the strange behavior that some mappers get zero records (zero files to copy) even though their 
> split lengths are non-zero, while other mappers get multiple records (multiple filesto copy) from their split (and beyond
> to the next sync mark). 
> When CopyFile class creates the sequencefile, it does try to place a sync mark between splitable segments in the sequence file by calling sync() function of the sequence file record writer. 
> Unfortunately, the sync() function is a no-op for files that are not block compressed.
> Naturally, after I changed the compression type for the sequence file to block compression,
> mappers got the correct records from their splits.
> So a simple fix is to change the compression tye to CompressionType.BLOCK:
> {code}
> // create src list
>     SequenceFile.Writer writer = SequenceFile.createWriter(
>         jobDirectory.getFileSystem(jobConf), jobConf, srcfilelist,
>         LongWritable.class, FilePair.class,
>         SequenceFile.CompressionType.BLOCK);.
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.