Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Wed, 10 Jul 2013 22:13:52 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.12470378.1280351826970.28916.1373494432522@arcas>
In-Reply-To: <JIRA.12470378.1280351826970@arcas>
References: <JIRA.12470378.1280351826970@arcas>
Subject: [jira] [Commented] (MAPREDUCE-1981) Improve getSplits performance
 by using listFiles, the new FileSystem API
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13705147#comment-13705147 ] 

Jason Lowe commented on MAPREDUCE-1981:
---------------------------------------

bq. This looks like it would still be useful, but the patch has fallen out of date.

Agree this would be very useful.  [~hairong] are you planning to update this anytime soon, or would you mind if I push it across the finish line?

bq. Also, how's this related to MAPREDUCE-2349 (if at all)?

MAPREDUCE-2349 involves using threads to pipeline the RPC overhead between independent calls.  Neither requires the other, so I think it's best to keep them separate.  Implementing this JIRA may make MAPREDUCE-2349 unimportant for the common case of a small number of input directories leading to a large number of overall files that need to be located.  However that JIRA could still be useful for the case where the input is a large list.  Without some bulk API in the filesystem there's no getting around doing all the RPC calls, and the thread idea in that JIRA can improve the latency of processing all those calls.
                
> Improve getSplits performance by using listFiles, the new FileSystem API
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1981
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1981
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: job submission
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>         Attachments: mapredListFiles1.patch, mapredListFiles2.patch, mapredListFiles3.patch, mapredListFiles4.patch, mapredListFiles5.patch, mapredListFiles.patch
>
>
> This jira will make FileInputFormat and CombinedFileInputForm to use the new API, thus reducing the number of RPCs to HDFS NameNode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira