hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-985) HDFS should issue multiple RPCs for listing a large directory
Date Thu, 18 Mar 2010 02:30:28 GMT

    [ https://issues.apache.org/jira/browse/HDFS-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846724#action_12846724

Hadoop QA commented on HDFS-985:

-1 overall.  Here are the results of testing the latest attachment 
  against trunk revision 923467.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 19 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/272/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/272/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/272/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/272/console

This message is automatically generated.

> HDFS should issue multiple RPCs for listing a large directory
> -------------------------------------------------------------
>                 Key: HDFS-985
>                 URL: https://issues.apache.org/jira/browse/HDFS-985
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.22.0
>         Attachments: directoryBrowse_0.20yahoo.patch, directoryBrowse_0.20yahoo_1.patch,
directoryBrowse_0.20yahoo_2.patch, iterativeLS_trunk.patch, iterativeLS_trunk1.patch, iterativeLS_trunk2.patch,
iterativeLS_trunk3.patch, iterativeLS_trunk3.patch, iterativeLS_trunk4.patch, iterativeLS_yahoo.patch,
iterativeLS_yahoo1.patch, testFileStatus.patch
> Currently HDFS issues one RPC from the client to the NameNode for listing a directory.
However some directories are large that contain thousands or millions of items. Listing such
large directories in one RPC has a few shortcomings:
> 1. The list operation holds the global fsnamesystem lock for a long time thus blocking
other requests. If a large number (like thousands) of such list requests hit NameNode in a
short period of time, NameNode will be significantly slowed down. Users end up noticing longer
response time or lost connections to NameNode.
> 2. The response message is uncontrollable big. We observed a response as big as 50M bytes
when listing a directory of 300 thousand items. Even with the optimization introduced at HDFS-946
that may be able to cut the response by 20-50%, the response size will still in the magnitude
of 10 mega bytes.
> I propose to implement a directory listing using multiple RPCs. Here is the plan:
> 1. Each getListing RPC has an upper limit on the number of items returned.  This limit
could be configurable, but I am thinking to set it to be a fixed number like 500.
> 2. Each RPC additionally specifies a start position for this listing request. I am thinking
to use the last item of the previous listing RPC as an indicator. Since NameNode stores all
items in a directory as a sorted array, NameNode uses the last item to locate the start item
of this listing even if the last item is deleted in between these two consecutive calls. This
has the advantage of avoid duplicate entries at the client side.
> 3. The return value additionally specifies if the whole directory is done listing. If
the client sees a false flag, it will continue to issue another RPC.
> This proposal will change the semantics of large directory listing in a sense that listing
is no longer an atomic operation if a directory's content is changing while the listing operation
is in progress.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message