Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5FC24200B21 for ; Thu, 12 May 2016 02:30:26 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5E6F5160A18; Thu, 12 May 2016 00:30:26 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A7EF3160A17 for ; Thu, 12 May 2016 02:30:25 +0200 (CEST) Received: (qmail 14865 invoked by uid 500); 12 May 2016 00:30:24 -0000 Mailing-List: contact dev-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@impala.incubator.apache.org Delivered-To: mailing list dev@impala.incubator.apache.org Received: (qmail 14854 invoked by uid 99); 12 May 2016 00:30:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2016 00:30:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 28540C0218 for ; Thu, 12 May 2016 00:30:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id XWxNahPudTjo for ; Thu, 12 May 2016 00:30:22 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id A3EC45F1F4 for ; Thu, 12 May 2016 00:30:21 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id u4C0UK1m009558; Thu, 12 May 2016 00:30:20 GMT Message-Id: <201605120030.u4C0UK1m009558@ip-10-146-233-104.ec2.internal> Date: Thu, 12 May 2016 00:30:16 +0000 From: "Sailesh Mukil (Code Review)" To: Alex Behm , Dan Hecht , impala-cr@cloudera.com, dev@impala.incubator.apache.org Reply-To: sailesh@cloudera.com X-Gerrit-MessageType: newpatchset Subject: =?UTF-8?Q?[Impala-CR](cdh5-trunk)_IMPALA-3453:_S3:_Uneven_split_sizes_are_generated_for_Parquet_causing_execution_skew=0A?= X-Gerrit-Change-Id: Ib1518ad0c89ef35a3b0567c3902e85a41e34bc3d X-Gerrit-ChangeURL: X-Gerrit-Commit: 0a5b2e0420bc724fd2a2047f7ac7a2dbcf362af4 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.10-rc0 archived-at: Thu, 12 May 2016 00:30:26 -0000 Hello Alex Behm, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2968 to look at the new patch set (#3). Change subject: IMPALA-3453: S3: Uneven split sizes are generated for Parquet causing execution skew ...................................................................... IMPALA-3453: S3: Uneven split sizes are generated for Parquet causing execution skew Previously the Parquet file format was considered by us as a non-splittable file format. However, we have since done some work on our parquet scanner that will assign row groups based on the split that contains them. This allows for us to chop up a parquet file into multiple splits and still have the file be scanned reliably. This patch changes our perception of Parquet as a splittable file format, which now allows synthesizeBlockMetadata() to split a parquet file on S3 into multiple "blocks" instead of assigning one scan range per file, so that there is an even distribution of scan ranges across the cluster, hence minimizing skew greatly. P.S: To control the size of scan ranges for splittable files on S3, you can change the default "block" size for the S3A filesystem which is governed by "fs.s3a.block.size". Its default value is 32MB. Change-Id: Ib1518ad0c89ef35a3b0567c3902e85a41e34bc3d --- M fe/src/main/java/com/cloudera/impala/catalog/HdfsFileFormat.java 1 file changed, 1 insertion(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/68/2968/3 -- To view, visit http://gerrit.cloudera.org:8080/2968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib1518ad0c89ef35a3b0567c3902e85a41e34bc3d Gerrit-PatchSet: 3 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Sailesh Mukil Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Sailesh Mukil