hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anatoli Shein (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10679) libhdfs++: Implement parallel find with wildcards tool
Date Thu, 11 Aug 2016 17:47:21 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417657#comment-15417657

Anatoli Shein commented on HDFS-10679:

I tested on my local cluster with the following directory tree:
Top level dir / 7 directories / 100 directories each / 100 directories each.
Both Libhdfs++ and java hadoop versions return correct outputs, so I redirected them to /dev/null.

Results for Libhdfs++: find / * 1 >/dev/null 2>&1
real        0m4.323s
user       0m1.236s
sys          0m0.864s
Results for java: hadoop fs -ls –R / >/dev/null 2>&1
real        0m51.517s
user       0m16.368s
sys          0m26.988s

So we are 12x faster (using cpp)!

For a single empty directory we get this:
real        0m0.012s
user       0m0.004s
sys          0m0.004s

And java Hadoop is:
real        0m2.462s
user       0m3.224s
sys          0m0.524s

We are 205x faster here because we do not have JVM start overhead.

> libhdfs++: Implement parallel find with wildcards tool
> ------------------------------------------------------
>                 Key: HDFS-10679
>                 URL: https://issues.apache.org/jira/browse/HDFS-10679
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: Anatoli Shein
>            Assignee: Anatoli Shein
>         Attachments: HDFS-10679.HDFS-8707.000.patch, HDFS-10679.HDFS-8707.001.patch,
HDFS-10679.HDFS-8707.002.patch, HDFS-10679.HDFS-8707.003.patch, HDFS-10679.HDFS-8707.004.patch,
HDFS-10679.HDFS-8707.005.patch, HDFS-10679.HDFS-8707.006.patch, HDFS-10679.HDFS-8707.007.patch,
HDFS-10679.HDFS-8707.008.patch, HDFS-10679.HDFS-8707.009.patch
> The find tool will issue the GetListing namenode operation on a given directory, and
filter the results using posix globbing library.
> If the recursive option is selected, for each returned entry that is a directory the
tool will issue another asynchronous call GetListing and repeat the result processing in a
recursive fashion.
> One implementation issue that needs to be addressed is the way how results are returned
back to the user: we can either buffer the results and return them to the user in bulk, or
we can return results continuously as they arrive. While buffering would be an easier solution,
returning results as they arrive would be more beneficial to the user in terms of performance,
since the result processing can start as soon as the first results arrive without any delay.
In order to do that we need the user to use a loop to process arriving results, and we need
to send a special message back to the user when the search is over.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message