hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-1981) Improve getSplits performance by using listFiles, the new FileSystem API
Date Wed, 10 Jul 2013 22:13:52 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13705147#comment-13705147

Jason Lowe commented on MAPREDUCE-1981:

bq. This looks like it would still be useful, but the patch has fallen out of date.

Agree this would be very useful.  [~hairong] are you planning to update this anytime soon,
or would you mind if I push it across the finish line?

bq. Also, how's this related to MAPREDUCE-2349 (if at all)?

MAPREDUCE-2349 involves using threads to pipeline the RPC overhead between independent calls.
 Neither requires the other, so I think it's best to keep them separate.  Implementing this
JIRA may make MAPREDUCE-2349 unimportant for the common case of a small number of input directories
leading to a large number of overall files that need to be located.  However that JIRA could
still be useful for the case where the input is a large list.  Without some bulk API in the
filesystem there's no getting around doing all the RPC calls, and the thread idea in that
JIRA can improve the latency of processing all those calls.
> Improve getSplits performance by using listFiles, the new FileSystem API
> ------------------------------------------------------------------------
>                 Key: MAPREDUCE-1981
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1981
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: job submission
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: mapredListFiles1.patch, mapredListFiles2.patch, mapredListFiles3.patch,
mapredListFiles4.patch, mapredListFiles5.patch, mapredListFiles.patch
> This jira will make FileInputFormat and CombinedFileInputForm to use the new API, thus
reducing the number of RPCs to HDFS NameNode.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message