hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anatoli Shein (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10679) libhdfs++: Implement parallel find with wildcards tool
Date Thu, 11 Aug 2016 17:48:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417660#comment-15417660
] 

Anatoli Shein commented on HDFS-10679:
--------------------------------------

Also, I just ran a test using ā€œ/usr/bin/time ā€“vā€ to measure memory consumption on the
following directory structure:
root dir / 10 dirs / 1500 dirs each
(15011 directories total)

Our output:
               Command being timed: "find hdfs://localhost.localdomain:9433/ * 1"
               User time (seconds): 0.33
               System time (seconds): 0.11
               Percent of CPU this job got: 54%
               Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.82
               Average shared text size (kbytes): 0
               Average unshared data size (kbytes): 0
               Average stack size (kbytes): 0
               Average total size (kbytes): 0
               Maximum resident set size (kbytes): 17948
               Average resident set size (kbytes): 0
               Major (requiring I/O) page faults: 0
               Minor (reclaiming a frame) page faults: 2835
               Voluntary context switches: 4297
               Involuntary context switches: 27
               Swaps: 0
               File system inputs: 0
               File system outputs: 0
               Socket messages sent: 0
               Socket messages received: 0
               Signals delivered: 0
               Page size (bytes): 4096
               Exit status: 0

Java Hadoop output:
               Command being timed: "hadoop fs -ls -R hdfs://localhost.localdomain:9433/"
               User time (seconds): 14.19
               System time (seconds): 7.68
               Percent of CPU this job got: 142%
               Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.39
               Average shared text size (kbytes): 0
               Average unshared data size (kbytes): 0
               Average stack size (kbytes): 0
               Average total size (kbytes): 0
               Maximum resident set size (kbytes): 293088
               Average resident set size (kbytes): 0
               Major (requiring I/O) page faults: 0
               Minor (reclaiming a frame) page faults: 84515
               Voluntary context switches: 82654
               Involuntary context switches: 18714
               Swaps: 0
               File system inputs: 0
               File system outputs: 112
               Socket messages sent: 0
               Socket messages received: 0
               Signals delivered: 0
               Page size (bytes): 4096
               Exit status: 0

So we are using 18 Mb of memory vs javas 293 Mb (about 16x less).
And our execution time here is also 19x faster.

I am also planning to run a test with a million directories.

> libhdfs++: Implement parallel find with wildcards tool
> ------------------------------------------------------
>
>                 Key: HDFS-10679
>                 URL: https://issues.apache.org/jira/browse/HDFS-10679
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: Anatoli Shein
>            Assignee: Anatoli Shein
>         Attachments: HDFS-10679.HDFS-8707.000.patch, HDFS-10679.HDFS-8707.001.patch,
HDFS-10679.HDFS-8707.002.patch, HDFS-10679.HDFS-8707.003.patch, HDFS-10679.HDFS-8707.004.patch,
HDFS-10679.HDFS-8707.005.patch, HDFS-10679.HDFS-8707.006.patch, HDFS-10679.HDFS-8707.007.patch,
HDFS-10679.HDFS-8707.008.patch, HDFS-10679.HDFS-8707.009.patch
>
>
> The find tool will issue the GetListing namenode operation on a given directory, and
filter the results using posix globbing library.
> If the recursive option is selected, for each returned entry that is a directory the
tool will issue another asynchronous call GetListing and repeat the result processing in a
recursive fashion.
> One implementation issue that needs to be addressed is the way how results are returned
back to the user: we can either buffer the results and return them to the user in bulk, or
we can return results continuously as they arrive. While buffering would be an easier solution,
returning results as they arrive would be more beneficial to the user in terms of performance,
since the result processing can start as soon as the first results arrive without any delay.
In order to do that we need the user to use a loop to process arriving results, and we need
to send a special message back to the user when the search is over.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message