hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
Date Fri, 30 Jul 2010 00:43:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893883#action_12893883

Suresh Srinivas commented on HDFS-202:

Unix 'ls' returns all the results in one shot. However, when getting response iteratively
the behavior is different:
# When listing a single directory, if some ls results has been returned and the directory
is deleted, we should throw FileNotFoundException, to indicate the directory is no longer
# When recursively listing under a directory, if a subdirectory is deleted, the more appropriate
response is to ignore FileNotFound for that directory and return the remaining results. This
would be consistent with what the result would be, if the command is repeated. Further, if
an application is listing recursively a large directory, the state of the directory keeps
changing, an application may have to try many times to list it.

> Add a bulk FIleSystem.getFileBlockLocations
> -------------------------------------------
>                 Key: HDFS-202
>                 URL: https://issues.apache.org/jira/browse/HDFS-202
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>            Assignee: Hairong Kuang
>             Fix For: 0.22.0
>         Attachments: hdfsListFiles.patch, hdfsListFiles1.patch
> Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations
to compute splits. However they are forced to call it once per file.
> The downsides are multiple:
>    # Even with a few thousand files to process the number of RPCs quickly starts getting
>    # The current implementation of getFileBlockLocations is too slow since each call
results in 'search' in the namesystem. Assuming a few thousand input files it results in that
many RPCs and 'searches'.
> It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory,
and return the block-locations for all files in that directory. We could eliminate both the
per-file RPC and also the 'search' by a 'scan'.
> When I tested this for terasort, a moderate job with 8000 input files the runtime halved
from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message