Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4D35D200C54 for ; Wed, 12 Apr 2017 14:13:51 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 4BB71160B95; Wed, 12 Apr 2017 12:13:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 92E31160B8A for ; Wed, 12 Apr 2017 14:13:50 +0200 (CEST) Received: (qmail 97088 invoked by uid 500); 12 Apr 2017 12:13:49 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 97077 invoked by uid 99); 12 Apr 2017 12:13:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Apr 2017 12:13:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id D5E23C02A7 for ; Wed, 12 Apr 2017 12:13:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id HItOWA-cMdPb for ; Wed, 12 Apr 2017 12:13:44 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 6B84A5FB49 for ; Wed, 12 Apr 2017 12:13:43 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9887AE0A6C for ; Wed, 12 Apr 2017 12:13:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id AC5962406D for ; Wed, 12 Apr 2017 12:13:41 +0000 (UTC) Date: Wed, 12 Apr 2017 12:13:41 +0000 (UTC) From: "Steve Loughran (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 12 Apr 2017 12:13:51 -0000 [ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965716#comment-15965716 ] Steve Loughran commented on MAPREDUCE-5907: ------------------------------------------- I don't know anyone looking at it. It's an out of date patch, combining optimisations in the FS code, S3N and HAR FS implmentations, & changes in the MR Code to match If the changes to the mapreduce module can go in today, using the existing {{FileSystem.listFiles(path, recursive}} call then it''ll be straightforward: that's the only bit which needs review and merge; S3A already handles that recursively very efficiently, and the other object stores can be brought up to speed. If we need changes to the FS, well, I'm not against them (there's definite inconsistencies there), but it's a more serious change: the HDFS team will need to look at that, we'll need changes to the FS spec, contract tests, etc, etc. Lots of work and so harder to get in. Why not see if you can apply just the MR changes, and what happens? > Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing > ---------------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-5907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: client > Affects Versions: 2.4.0 > Reporter: Sumit Kumar > Assignee: Sumit Kumar > Labels: BB2015-05-TBR > Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, MAPREDUCE-5907.patch > > > FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. > This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org