hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (JIRA)" <j...@apache.org>
Subject [jira] [Moved] (HDFS-2252) Improve dfs_getattr running time
Date Thu, 11 Aug 2011 18:24:30 GMT

     [ https://issues.apache.org/jira/browse/HDFS-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eli Collins moved HADOOP-4682 to HDFS-2252:
-------------------------------------------

    Affects Version/s:     (was: 0.20.0)
           Issue Type: Bug  (was: Improvement)
                  Key: HDFS-2252  (was: HADOOP-4682)
              Project: Hadoop HDFS  (was: Hadoop Common)

> Improve dfs_getattr running time
> --------------------------------
>
>                 Key: HDFS-2252
>                 URL: https://issues.apache.org/jira/browse/HDFS-2252
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Marc-Olivier Fleury
>
> As explained in  issue  HADOOP-3797, stat takes a long time to execute. 
> I got a clearer idea of the time needed when testing a c program that needed to crawl
a directory tree, that contains 10s of directories and 100K files. The original version used
stat() to make the difference between files an folders. It needed about 1h to complete. I
corrected it to use dirent.d_type, which provides the same information and is available at
no extra cost when using readdir. The execution time changed to 2-3 mins.
> I tried to do other benchmarks using ls with or without color, and on the local file
system, I got a speedup of 1.3, while on hdfs, the speedup was of 5.7. This means (very roughly)
that calling stat with fuse is 5.7/1.3 = 4.4 times slower.
> When using application that rely on stat to work correctly (there is sometimes no other
way to make the difference between a file and a folder), this can be a major source of delay.
The application I am working on needs to stat about 30'000 files; a faster stat() function
would save me hours (per task).
> I am sure that I am not the only one who would appreciate a speedup, so I suppose this
issue should be put into consideration.
> I do not know if the bottleneck is the call to hdfsGetPathInfo or to doConnectAsUser,
but if it comes from doConnectAsUser, some improvements can surely be made.
> And in the worst case, caching might help, as suggested in HADOOP-3797.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message