impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Date Fri, 04 Nov 2016 11:31:25 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input

Patch Set 8:


Thanks for the reviews, please see PS8. I will update again once the private builds for S3
and local FS have finished.
File be/src/exec/

Line 530:   DCHECK(current_row != NULL || key == ROOT_PARTITION_KEY);
> Doesn't the above check need to be the same as here?
My assumption was that InitOutputPartition wouldn't be called if key==ROOT_PARTITION_KEY,
since that partition would exist already. However that seems to be wrong. The key isn't passed
into InitOutputPartition so we cannot repeat the check there.
File common/thrift/DataSinks.thrift:

Line 70:   // partition keys, meaning partitions can be opened, written, and closed one by
> I think we just use the double // here (and above)
Done. The one above was me some time ago, too :)

Line 71:   4: required bool input_is_clustered
> remove trailing semicolon
Done, and elsewhere in the file.
File fe/src/main/java/org/apache/impala/planner/

Line 51:   protected final boolean inputIsClustered_;
> inputIsClustered_
File fe/src/main/java/org/apache/impala/planner/

Line 85:       boolean overwrite, boolean ignoreDuplicates, boolean inputIsClustered) {
> use Java style: inputIsClustered
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 912: partition (year, month) /*+ clustered,noshuffle */
> I missed it earlier that we don't have tests for unpartitioned inserts with
Done. I added a test and a DCHECK in HdfsTableSink::WriteClusteredRowBatch() to make sure
we're performing a partitioned insert.

However I'm not sure now whether we actually should allow specifying /*+ clustered */ for
non-partitioned tables at all, since it won't have any impact on the plan and will just be
silently ignored.
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 863: insert into table alltypesinsert
> do we have coverage of the clustered with and without shuffling?
So far we only had coverage in the FE. Added a test. Should we run all tests with shuffle
and noshuffle? I assume shuffle makes more sense since we mostly will want to use clustered
with large inserts. I added it to the other tests.
File tests/query_test/

Line 112:   def test_insert_test(self, vector):
> ?
This makes it possible to test only "insert.test" by specifying "-k test_insert_test" to impala-py.test.
A lot of our BE tests have this issue of not being able to just select the main test by -k.
Should I revert this, or add a comment? It's explained in the commit message.
File tests/query_test/

Line 481:     table_path = get_fs_path(
> don't we need to add the filesystem prefix for S3 and local FS?
I wasn't aware that the S3 and local FS cases were handled by the Hdfs code. I added get_fs_path()
calls and started private runs for both S3 and local FS and will update this once they're

Line 489: 
> add shuffle hint to make sure we shuffle

Line 507:     # This test takes about 30 seconds and we are unlikely to break it, so only
run it in
> give reason. takes too long?
Added a comment.

Line 509:     if self.exploration_strategy() != 'exhaustive':
> same comment about the fs prefix
See comment above.

Line 534: 
> does this test make any assumptions about whether the shuffling behavior? b
No, it doesn't make any assumptions. I added the hint.

Line 554:                l_returnflag
> it feels like this could become flaky relatively easily (changing compressi
What would be good values here? Should we leave it and see if things break and then gradually

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message