[ https://issues.apache.org/jira/browse/HDFS-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000414#comment-13000414
]
Tsz Wo (Nicholas), SZE commented on HDFS-1658:
----------------------------------------------
> ... But still an application needs to call getFileInfo first. After it figures out the
path is a directory, ...
It is unnecessary to call getFileInfo first since whether a path is a directory can be determined
by the file/directory counts returned by {{getContentSummary(..)}}.
We have two solutions:
# Change {{FileSystem.getFileInfo(..)}} to return the number of children in {{FileStatus.length}}.
# Add a new {{FileSystem.getContentSummary(..)}} method with _depth_ as a parameter.
\\
#1 is a semantic change but #2 is not. I am afraid that there may be user codes relying on
the fact that {{FileStatus.length == 0}} when the path is a directory.
> A less expensive way to figure out directory size
> -------------------------------------------------
>
> Key: HDFS-1658
> URL: https://issues.apache.org/jira/browse/HDFS-1658
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Hairong Kuang
> Assignee: Hairong Kuang
>
> Currently in order to figure out a directory size, we have to list a directory by calling
RPC getListing and get the number of its children. This is an expensive operation especially
when a directory has many children because it may require multiple RPCs.
> On the other hand when fetching the status of a path (i.e. calling RPC getFileInfo),
the length field of FileStatus is set to be 0 if the path is a directory.
> I am thinking to change this field (FileStatus#length) to be the directory size when
the path is a directory. So we can call getFileInfo to get the directory size. This call is
much less expensive and simpler than getListing.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
|