hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14086) Improve DistCp Speed for small files
Date Mon, 27 Feb 2017 19:17:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886349#comment-15886349

Steve Loughran commented on HADOOP-14086:

nothing, yet, I'm just scared about what could be done.

if you look at HADOOP-11694 you can see what will be coming your way, the big one is HADOOP-13208,
as it can go from treewalking a mocked directory tree into direct object store API calls.

if you can use the listFiles calls here then again: significant speedup, especially at scale.

That listfiles call also returns a remote iterator with {{LocatedFileStatus()}} instances;
it is up to the implementation to see if they could optimise it. Maybe HDFS could do some
stuff here too, e.g. async refresh of the next batch of entries while the first lot is being

Note also HADOOP-13169; randomizing file listing to spread load across shards in s3, so boosting
both read and write performance.

> Improve DistCp Speed for small files
> ------------------------------------
>                 Key: HADOOP-14086
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14086
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.6.5
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Minor
> When using distcp to copy lots of small files,  NameNode naturally becomes a bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls.  We should restructure
the code to reduce the number of NameNode calls as much as possible to speed up the copy of
small files.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message