hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zoran Dimitrijevic (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11827) Speed-up distcp buildListing() using threadpool
Date Sat, 11 Apr 2015 19:45:12 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491170#comment-14491170
] 

Zoran Dimitrijevic commented on HADOOP-11827:
---------------------------------------------

Performance results and charts for dataset I used (1.5M files and approx 50K dirs): 

https://docs.google.com/spreadsheets/d/1qJfO9ZhPXuGCpHyfX1NLE0Zm_NB39gn-cELECShd_zk/edit#gid=0

Please note that there are two sheets (s3n -> hdfs and hdfs -> hdfs). Main improvement
is when source is in s3. Improvements when source is hdfs is good as well, but since current
distcp has to sort input file total improvement is not as important). 

TODO: We can sort only directories which would further improve startup time.

> Speed-up distcp buildListing() using threadpool
> -----------------------------------------------
>
>                 Key: HADOOP-11827
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11827
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 3.0.0
>            Reporter: Zoran Dimitrijevic
>            Assignee: Zoran Dimitrijevic
>            Priority: Minor
>         Attachments: HADOOP-11827.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> For very large source trees on s3 distcp is taking long time to build file listing (client
code, before starting mappers). For a dataset I used (1.5M files, 50K dirs) it was taking
65 minutes before my fix in HADOOP-11785 and 36 minutes after the fix).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message