hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HADOOP-2032) distcp split generation does not work correctly
Date Wed, 17 Oct 2007 20:38:50 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Douglas resolved HADOOP-2032.
-----------------------------------

    Resolution: Duplicate

Fixed by HADOOP-2033

> distcp split generation does not work correctly
> -----------------------------------------------
>
>                 Key: HADOOP-2032
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2032
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>            Reporter: Runping Qi
>
> With the current implementation, distcp will always assign multiple files to one mapper
to copy, no matter how large 
> are the files. This is because the CopyFiles class uses a sequencefile to store the list
of files to be copied, 
> one record per file. CopyFile class correctly generates one split per record in the sequence
file. However, 
> due to  the way the sequence file record reader works, the minimum unit for splits is
the segments between the 
> "syncmarks" in the sequence file. 
> This results in the strange behavior that some mappers get zero records (zero files to
copy) even though their 
> split lengths are non-zero, while other mappers get multiple records (multiple filesto
copy) from their split (and beyond
> to the next sync mark). 
> When CopyFile class creates the sequencefile, it does try to place a sync mark between
splitable segments in the sequence file by calling sync() function of the sequence file record
writer. 
> Unfortunately, the sync() function is a no-op for files that are not block compressed.
> Naturally, after I changed the compression type for the sequence file to block compression,
> mappers got the correct records from their splits.
> So a simple fix is to change the compression tye to CompressionType.BLOCK:
> {code}
> // create src list
>     SequenceFile.Writer writer = SequenceFile.createWriter(
>         jobDirectory.getFileSystem(jobConf), jobConf, srcfilelist,
>         LongWritable.class, FilePair.class,
>         SequenceFile.CompressionType.BLOCK);.
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message