hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hairong Kuang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
Date Fri, 12 Feb 2010 00:00:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832745#action_12832745

Hairong Kuang commented on HDFS-946:

> If you are proposing that the object that is sent over-the-wire is different from FileStatus.
If so, please consider the requirement of HDFS-878 too.
This jira tries to reduce the cost of getFileInfo and listing a directory, where HDFS-878
adds cost to these two operations.. So I will not implement HDFS-878 in this jira. Since we
are having so many problems with getFileInfo and list a directory, we should be very cautious
about adding anything to FileStatus in hdfs unless it is absolutely necessary.

I have conducted some experiments with my patch. I write an application that spawns 100 threads,
each of which lists a directory of size 1300 for 200 times. I use yourKit to profile the NameNode
while the application is running. Without the patch, NameNode's CPU utilization is 20~26%
and time spent on GC is 3~5%. With the patch, NameNode's CPU utilization drops to 12~17% and
the time spent on GS is mostly 0% but occasionally becomes 1 or 2%.

> NameNode should not return full path name when lisitng a diretory or getting the status
of a file
> -------------------------------------------------------------------------------------------------
>                 Key: HDFS-946
>                 URL: https://issues.apache.org/jira/browse/HDFS-946
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.22.0
>         Attachments: HDFSFileStatus.patch, HDFSFileStatus1.patch
> FSDirectory#getListring(String src) has the following code:
>       int i = 0;
>       for (INode cur : contents) {
>         listing[i] = createFileStatus(srcs+cur.getLocalName(), cur);
>         i++;
>       }
> So listing a directory will return an array of FileStatus. Each FileStatus element has
the full path name. This increases the return message size and adds non-negligible CPU time
to the operation.
> FSDirectory#getFileInfo(String) does not need to return the file name either.
> Another optimization is that in the version of FileStatus that's used in the wire protocol,
the field path does not need to be Path; It could be a String or a byte array ideally. This
could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC
problem observed when a large number of getFileInfo or getListing operations hit NameNode.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message