hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14086) Improve DistCp Speed for small files
Date Thu, 16 Feb 2017 10:41:42 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869708#comment-15869708

Steve Loughran commented on HADOOP-14086:

# target version will have to be branch-2+, with backports as people feel appropriate
# please don't make things worse for object stores. One thing we've started doing there is
massively boost the performance of listFiles(path, recursive=true), which we can take from
being a slow emulation of a recursive treewalk to an O(1+ files/5000) call. If you could use
that to iterate over the LocatedFileStatus entries, then hand off that status data direct
to the workers, then it'd be great for object stores, while still delivering good NN perf

> Improve DistCp Speed for small files
> ------------------------------------
>                 Key: HADOOP-14086
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14086
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.6.5
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Minor
> When using distcp to copy lots of small files,  NameNode naturally becomes a bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls.  We should restructure
the code to reduce the number of NameNode calls as much as possible to speed up the copy of
small files.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message