hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo (Nicholas), SZE (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1658) A less expensive way to figure out directory size
Date Mon, 28 Feb 2011 17:26:36 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000414#comment-13000414

Tsz Wo (Nicholas), SZE commented on HDFS-1658:

> ... But still an application needs to call getFileInfo first. After it figures out the
path is a directory, ...

It is unnecessary to call getFileInfo first since whether a path is a directory can be determined
by the file/directory counts returned by {{getContentSummary(..)}}.

We have two solutions:
# Change {{FileSystem.getFileInfo(..)}} to return the number of children in {{FileStatus.length}}.
# Add a new {{FileSystem.getContentSummary(..)}} method with _depth_ as a parameter.

#1 is a semantic change but #2 is not.  I am afraid that there may be user codes relying on
the fact that {{FileStatus.length == 0}} when the path is a directory.

> A less expensive way to figure out directory size
> -------------------------------------------------
>                 Key: HDFS-1658
>                 URL: https://issues.apache.org/jira/browse/HDFS-1658
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
> Currently in order to figure out a directory size, we have to list a directory by calling
RPC getListing and get the number of its children. This is an expensive operation especially
when a directory has many children because it may require multiple RPCs.
> On the other hand when fetching the status of a path (i.e. calling RPC getFileInfo),
the length field of FileStatus is set to be 0 if the path is a directory.
> I am thinking to change this field (FileStatus#length) to be the directory size when
the path is a directory. So we can call getFileInfo to get the directory size. This call is
much less expensive and simpler than getListing.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message