From common-issues-return-147276-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Fri Jan 26 23:20:07 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 1C20E180657 for ; Fri, 26 Jan 2018 23:20:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 0C58F160C2E; Fri, 26 Jan 2018 22:20:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 54CD5160C51 for ; Fri, 26 Jan 2018 23:20:06 +0100 (CET) Received: (qmail 29285 invoked by uid 500); 26 Jan 2018 22:20:05 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 29208 invoked by uid 99); 26 Jan 2018 22:20:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Jan 2018 22:20:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id C060C180169 for ; Fri, 26 Jan 2018 22:20:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.511 X-Spam-Level: X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id R7bKczIFkyQ1 for ; Fri, 26 Jan 2018 22:20:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 40F6060DF8 for ; Fri, 26 Jan 2018 22:20:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 33CC7E00D1 for ; Fri, 26 Jan 2018 22:20:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 7B1BE2411E for ; Fri, 26 Jan 2018 22:20:00 +0000 (UTC) Date: Fri, 26 Jan 2018 22:20:00 +0000 (UTC) From: "Steve Loughran (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-15192) S3A listStatus excessively slow -hurts Spark job partitioning MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-15192?page=3Dcom.atlassi= an.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D16= 341686#comment-16341686 ]=20 Steve Loughran commented on HADOOP-15192: ----------------------------------------- well, really is that we are reaching the limits of how well we can make obj= ect stores pretend to be filesystems. Can't blame the apps there > S3A listStatus excessively slow -hurts Spark job partitioning > ------------------------------------------------------------- > > Key: HADOOP-15192 > URL: https://issues.apache.org/jira/browse/HADOOP-15192 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 > Affects Versions: 2.7.3 > Reporter: Michel Lemay > Priority: Minor > Fix For: 2.8.0 > > > Symptoms: > - CloudWatch Metrics for S3 showing an unexpectedly large number of 4xx = errors in our bucket > - Performance when listing files recursively is abysmal (15 minutes on o= ur bucket compared to less than 2 minutes using cli `aws s3 ls`) > Analysis: > - In CloudTrail logs for this bucket, we found that it generate one 404 = (NoSuchKey) error per folder listed recursively. > - Spark recursively calls FileSystem::listStatus (S3AFileSystem implemen= tation from Hadoop-aws:2.7.3); which in turn calls getFileStatus to determi= ne if it is a directory. > - It turns out that this call to getFileStatus yield a 404 when the path= used is a directory but do not end with a slash. It then retries with the = slash concatenated (incurring one extra unneeded call to S3). > Questions: > - Why is this trailing slash got removed in the first place? (Hadoop Pat= h class normalize it by removing trailing slashes when constructed) > - S3AFileSystem::listStatus needs to know if the path is a Directory. Ho= wever, it=E2=80=99s a common usage pattern to already have that FileStatus = object in hand when recursively listing files.=C2=A0 Thus incurring an unne= eded performance penalty.=C2=A0=C2=A0Base FileSystem class could offer an o= ptimized Api to use this assumption (or fix listLocatedStatus(recursive=3Dt= rue) unoptimized call to listStatus) > - I might be wrong on this last bullet but I think S3 object api will fe= tch every objects under a prefix (not just current level) and filter them o= ut.=C2=A0 If that is the case, there should be opportunities to have an eff= icient recursive listStatus implementation for s3 using paginated calls to = top level folder only. > =C2=A0 > Note, all this is in the context of spark jobs reading hundred of thousan= ds of parquet files organized and partitioned hierarchically as recommended= . Every time we read it, spark lists recursively all files and folders to d= iscover what are the partitions (folder names). > =C2=A0 > =C2=A0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org