impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Date Thu, 03 Nov 2016 22:57:44 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input

Patch Set 7:


Be changes look pretty good to me.
File common/thrift/DataSinks.thrift:

Line 70:   /// partition keys, meaning partitions can be opened, written, and closed one by
I think we just use the double // here (and above)

Line 71:   4: required bool input_is_clustered;
remove trailing semicolon
File fe/src/main/java/org/apache/impala/planner/

Line 51:   protected final boolean input_is_clustered_;
File fe/src/main/java/org/apache/impala/planner/

Line 85:       boolean overwrite, boolean ignoreDuplicates, boolean input_is_clustered) {
use Java style: inputIsClustered
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 863: insert into table alltypesinsert
do we have coverage of the clustered with and without shuffling?
File tests/query_test/

Line 112:   def test_insert_test(self, vector):
File tests/query_test/

Line 481:     table_path = "test-warehouse/{0}.db/insert_clustered".format(unique_database)
don't we need to add the filesystem prefix for S3 and local FS?

get_fs_path() in

Alternatively, disable this test for those cases. Are we sure only a single file will be created
on S3?

Might be good to give this patch a private try on S3

Line 489:     insert_stmt = """insert into {0} partition(year, month) /*+ clustered */
add shuffle hint to make sure we shuffle

Line 507:       pytest.skip("only runs in exhaustive")
give reason. takes too long?

Line 509:     table_path = "test-warehouse/{0}.db/insert_clustered".format(unique_database)
same comment about the fs prefix

Line 534:     insert_stmt = """insert into {0} partition(l_returnflag) /*+ clustered */ select
does this test make any assumptions about whether the shuffling behavior? better to enforce
with a hint

Line 554:     expected_partition_files = [("l_returnflag=A", 7),
it feels like this could become flaky relatively easily (changing compression or parquet format
details), maybe relax the condition to check that the number of files are within some bound?

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message