impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sailesh Mukil (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-3453: S3: Uneven split sizes are generated for Parquet causing execution skew
Date Thu, 12 May 2016 00:30:16 GMT
Hello Alex Behm, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/2968

to look at the new patch set (#3).

Change subject: IMPALA-3453: S3: Uneven split sizes are generated for Parquet causing execution
skew
......................................................................

IMPALA-3453: S3: Uneven split sizes are generated for Parquet causing execution skew

Previously the Parquet file format was considered by us as a
non-splittable file format. However, we have since done some work on
our parquet scanner that will assign row groups based on the split
that contains them. This allows for us to chop up a parquet file into
multiple splits and still have the file be scanned reliably.

This patch changes our perception of Parquet as a splittable file
format, which now allows synthesizeBlockMetadata() to split a parquet
file on S3 into multiple "blocks" instead of assigning one scan range
per file, so that there is an even distribution of scan ranges across
the cluster, hence minimizing skew greatly.

P.S: To control the size of scan ranges for splittable files on S3,
you can change the default "block" size for the S3A filesystem which
is governed by "fs.s3a.block.size". Its default value is 32MB.

Change-Id: Ib1518ad0c89ef35a3b0567c3902e85a41e34bc3d
---
M fe/src/main/java/com/cloudera/impala/catalog/HdfsFileFormat.java
1 file changed, 1 insertion(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/68/2968/3
-- 
To view, visit http://gerrit.cloudera.org:8080/2968
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib1518ad0c89ef35a3b0567c3902e85a41e34bc3d
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sailesh@cloudera.com>

Mime
View raw message