hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5795) Add a bulk FIleSystem.getFileBlockLocations
Date Sun, 10 May 2009 09:41:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707768#action_12707768
] 

dhruba borthakur commented on HADOOP-5795:
------------------------------------------

If we adopt the approach that Doug has suggested, then the namenode still has to search for
each input path in the file system namespace. This approach still has the advantage that the
number of RPC calls are reduced. If we adopt Arun's proposal that specifies a directory and
the RPC-call returns the splits of all the files in that directory, then it reduces the number
of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning
towards Arun's proposal, but Doug's approach is a little more flexible in nature, isn't it?


> Add a bulk FIleSystem.getFileBlockLocations
> -------------------------------------------
>
>                 Key: HADOOP-5795
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5795
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Arun C Murthy
>            Assignee: Jakob Homan
>             Fix For: 0.21.0
>
>
> Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations
to compute splits. However they are forced to call it once per file.
> The downsides are multiple:
>    # Even with a few thousand files to process the number of RPCs quickly starts getting
noticeable
>    # The current implementation of getFileBlockLocations is too slow since each call
results in 'search' in the namesystem. Assuming a few thousand input files it results in that
many RPCs and 'searches'.
> It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory,
and return the block-locations for all files in that directory. We could eliminate both the
per-file RPC and also the 'search' by a 'scan'.
> When I tested this for terasort, a moderate job with 8000 input files the runtime halved
from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message