hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Krogen (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-14086) Improve DistCp Speed for small files
Date Mon, 27 Feb 2017 16:51:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886105#comment-15886105
] 

Erik Krogen edited comment on HADOOP-14086 at 2/27/17 4:51 PM:
---------------------------------------------------------------

[~zhz] currently there are multiple calls made for each file; even reducing a distcp for 1M
files to 1M {{getFileInfo}} calls would be a big improvement over the current implementation.

[~stevel@apache.org], what about this JIRA makes you worry that object store performance will
be worse? Nothing stands out to me so I am curious. Also, are you saying that the listFiles
performance work is already done, or under progress? Do you have a JIRA link? Sounds very
interesting.


was (Author: xkrogen):
[~zhz] currently there are multiple calls made for each file; even reducing a distcp for 1M
files to 1M {{getFileInfo}} calls would be a big improvement over the current implementation.

[~stevel@apache.org], what about this JIRA makes you worry that object store performance will
be worse? Nothing stands out to me so I am curious. Also, are you saying that the listFiles
performance work is already done, or under progress? Do you have a JIRA link?

> Improve DistCp Speed for small files
> ------------------------------------
>
>                 Key: HADOOP-14086
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14086
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.6.5
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Minor
>
> When using distcp to copy lots of small files,  NameNode naturally becomes a bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls.  We should restructure
the code to reduce the number of NameNode calls as much as possible to speed up the copy of
small files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message