hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
Date Mon, 21 Dec 2015 21:12:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067064#comment-15067064

Yongjun Zhang commented on HADOOP-11794:

Thanks [~mithun]!

Not sure about {{CombineFileINputFormat}}, but I will take a look.

Hmm... Do we? DistCp copies whole files (even if at a split level). Since we can retrieve
located blocks for all blocks in the file, shouldn't that be enough? We could group locatedBlocks
by block-id. Perhaps I'm missing something.

Sorry I was not clear. This jira is to avoid copying a large single file within one mapper.
What's in my mind is to break  large file into block ranges (by a new distcp command line
arg), such as (0, 10), (10, 20), ...(100, 4), each entry here is a pair (starting block index,
and number of blocks) here, all entries for the same file except the last entry have same
number of blocks.  So we could assign the entries of the same file to different mappers (to
work in parallel). In order to do this, we can have the API I described to fetch back block
locations for the block range. My argument is that fetching all block locations for a file
is not as efficient as fetching only the block range the mapper is assigned to work on.

Do you agree that the API would help based on my explanation here? I have done a prototype
of the API to fetch block locations of a block range, will try to post it after the holiday.
I think there may be other applications that need this kind of API too.


> distcp can copy blocks in parallel
> ----------------------------------
>                 Key: HADOOP-11794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11794
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: dhruba borthakur
>            Assignee: Yongjun Zhang
>         Attachments: MAPREDUCE-2257.patch
> The minimum unit of work for a distcp task is a file. We have files that are greater
than 1 TB with a block size of  1 GB. If we use distcp to copy these files, the tasks either
take a long long long time or finally fails. A better way for distcp would be to copy all
the source blocks in parallel, and then stich the blocks back to files at the destination
via the HDFS Concat API (HDFS-222)

This message was sent by Atlassian JIRA

View raw message