hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hairong Kuang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1658) A less expensive way to figure out directory size
Date Thu, 14 Apr 2011 04:30:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019687#comment-13019687

Hairong Kuang commented on HDFS-1658:

I want to discuss if we could pursue option #1. Right now if a path is a directory, FileStatus.length
is actually undefined. It happens that we chose to put 0 there. I think my proposal is just
to enhance the semantics, strictly speaking not an incompatible change.

> It is unnecessary to call getFileInfo first
The problem is that most applications work with FileStatus. For example getFileSplits in MapReduce
has to get FileStatus for all files by traversing the input directories by calling getFileInfo
and listStatus. If we can check a directory is empty by looking at its FileStatus, we can
avoid issue a listStatus call to list its children.

> A less expensive way to figure out directory size
> -------------------------------------------------
>                 Key: HDFS-1658
>                 URL: https://issues.apache.org/jira/browse/HDFS-1658
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
> Currently in order to figure out a directory size, we have to list a directory by calling
RPC getListing and get the number of its children. This is an expensive operation especially
when a directory has many children because it may require multiple RPCs.
> On the other hand when fetching the status of a path (i.e. calling RPC getFileInfo),
the length field of FileStatus is set to be 0 if the path is a directory.
> I am thinking to change this field (FileStatus#length) to be the directory size when
the path is a directory. So we can call getFileInfo to get the directory size. This call is
much less expensive and simpler than getListing.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message