Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DADB810E9F for ; Fri, 14 Feb 2014 09:24:30 +0000 (UTC) Received: (qmail 77163 invoked by uid 500); 14 Feb 2014 09:24:29 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 76954 invoked by uid 500); 14 Feb 2014 09:24:25 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 76893 invoked by uid 99); 14 Feb 2014 09:24:23 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Feb 2014 09:24:23 +0000 Date: Fri, 14 Feb 2014 09:24:23 +0000 (UTC) From: "Siddharth Seth (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAPREDUCE-2349) speed up list[located]status calls from input formats MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated MAPREDUCE-2349: -------------------------------------- Attachment: MAPREDUCE-2349.1.wip.txt WIP patch - which changes getSplits to make parallel requests. Have been using this for a while without issues - will try uploading a patch for review soon. > speed up list[located]status calls from input formats > ----------------------------------------------------- > > Key: MAPREDUCE-2349 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2349 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task > Reporter: Joydeep Sen Sarma > Attachments: MAPREDUCE-2349.1.wip.txt > > > when a job has many input paths - listStatus - or the improved listLocatedStatus - calls (invoked from the getSplits() method) can take a long time. Most of the time is spent waiting for the previous call to complete and then dispatching the next call. > This can be greatly speeded up by dispatching multiple calls at once (via executors). If the same filesystem client is used - then the calls are much better pipelined (since calls are serialized) and don't impose extra burden on the namenode while at the same time greatly reducing the latency to the client. In a simple test on non-peak hours, this resulted in the getSplits() time reducing from about 3s to about 0.5s. -- This message was sent by Atlassian JIRA (v6.1.5#6160)