Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D1C1F105D7 for ; Wed, 10 Jul 2013 22:13:52 +0000 (UTC) Received: (qmail 53702 invoked by uid 500); 10 Jul 2013 22:13:52 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 53659 invoked by uid 500); 10 Jul 2013 22:13:52 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 53650 invoked by uid 99); 10 Jul 2013 22:13:52 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jul 2013 22:13:52 +0000 Date: Wed, 10 Jul 2013 22:13:52 +0000 (UTC) From: "Jason Lowe (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-1981) Improve getSplits performance by using listFiles, the new FileSystem API MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13705147#comment-13705147 ] Jason Lowe commented on MAPREDUCE-1981: --------------------------------------- bq. This looks like it would still be useful, but the patch has fallen out of date. Agree this would be very useful. [~hairong] are you planning to update this anytime soon, or would you mind if I push it across the finish line? bq. Also, how's this related to MAPREDUCE-2349 (if at all)? MAPREDUCE-2349 involves using threads to pipeline the RPC overhead between independent calls. Neither requires the other, so I think it's best to keep them separate. Implementing this JIRA may make MAPREDUCE-2349 unimportant for the common case of a small number of input directories leading to a large number of overall files that need to be located. However that JIRA could still be useful for the case where the input is a large list. Without some bulk API in the filesystem there's no getting around doing all the RPC calls, and the thread idea in that JIRA can improve the latency of processing all those calls. > Improve getSplits performance by using listFiles, the new FileSystem API > ------------------------------------------------------------------------ > > Key: MAPREDUCE-1981 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1981 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: job submission > Reporter: Hairong Kuang > Assignee: Hairong Kuang > Attachments: mapredListFiles1.patch, mapredListFiles2.patch, mapredListFiles3.patch, mapredListFiles4.patch, mapredListFiles5.patch, mapredListFiles.patch > > > This jira will make FileInputFormat and CombinedFileInputForm to use the new API, thus reducing the number of RPCs to HDFS NameNode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira