hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Krogen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14137) Faster distcp by taking file list from fsimage or -lsr result
Date Thu, 02 Mar 2017 16:46:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892582#comment-15892582

Erik Krogen commented on HADOOP-14137:

+1 on this, we have just recently made similar efforts when trying to do a DistCp of very
large numbers of files and I think it is useful in general. One note, you'll also have to
provide a way for the user to specify what directory the files should be considered relative
to (e.g. if one of the listed files is "/user/erik/dir/file", how much of that directory structure
ends up being replicated on the target).

Also agreed with Steve that {{--listingFile}} is better than {{-list}}. 

> Faster distcp by taking file list from fsimage or -lsr result
> -------------------------------------------------------------
>                 Key: HADOOP-14137
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14137
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools/distcp
>            Reporter: Zheng Shao
> DistCp is very slow to start when the src directory has a huge number of subdirectories.
 In our case, we already have the directory listing (via "hdfs oiv -i fsimage" or via nightly
"hdfs dfs -lr -r /" dumps), and we would like to use that instead of doing realtime listing
on the NameNode.
> The "-f" option doesn't help in this case because it would try to put everything into
a single flat target directory.
> We'd like to introduce a new option "-list <file>" for distcp.  The <file>
contains the result of listing the src directory.
> In order to achieve this, we plan to:
> 1. Add a new CopyListing class PregeneratedCopyListing similar to SimpleCopyListing which
doesn't "-ls -r" into the directory, but takes the listing via "-list"
> 2. Add an option "-list <file>" which will automatically make distcp use the new
PregeneratedCopyListing class.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message