hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Shvachko <...@yahoo-inc.com>
Subject Re: [jira] Commented: (HDFS-202) Add a bulk FIleSystem.getFileBlockLocations
Date Wed, 11 Aug 2010 23:08:47 GMT
Yes I see it compiles now.

On 8/11/2010 1:47 PM, Hairong Kuang (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/HDFS-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897444#action_12897444
> Hairong Kuang commented on HDFS-202:
> ------------------------------------
> Konstantin, the hdfs trunk should be able to compile because I've committed this patch.
HDFS-202 is the HDFS side of HADOOP-6900!
> Thanks Suresh for reviewing this patch at full speed! :-)
>> Add a bulk FIleSystem.getFileBlockLocations
>> -------------------------------------------
>>                  Key: HDFS-202
>>                  URL: https://issues.apache.org/jira/browse/HDFS-202
>>              Project: Hadoop HDFS
>>           Issue Type: New Feature
>>           Components: hdfs client, name-node
>>             Reporter: Arun C Murthy
>>             Assignee: Hairong Kuang
>>              Fix For: 0.22.0
>>          Attachments: hdfsListFiles.patch, hdfsListFiles1.patch, hdfsListFiles2.patch,
hdfsListFiles3.patch, hdfsListFiles4.patch, hdfsListFiles5.patch
>> Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations
to compute splits. However they are forced to call it once per file.
>> The downsides are multiple:
>>     # Even with a few thousand files to process the number of RPCs quickly starts
getting noticeable
>>     # The current implementation of getFileBlockLocations is too slow since each
call results in 'search' in the namesystem. Assuming a few thousand input files it results
in that many RPCs and 'searches'.
>> It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory,
and return the block-locations for all files in that directory. We could eliminate both the
per-file RPC and also the 'search' by a 'scan'.
>> When I tested this for terasort, a moderate job with 8000 input files the runtime
halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive

View raw message