impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Date Fri, 04 Nov 2016 11:31:25 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
......................................................................


Patch Set 8:

(14 comments)

Thanks for the reviews, please see PS8. I will update again once the private builds for S3
and local FS have finished.

http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 530:   DCHECK(current_row != NULL || key == ROOT_PARTITION_KEY);
> Doesn't the above check need to be the same as here?
My assumption was that InitOutputPartition wouldn't be called if key==ROOT_PARTITION_KEY,
since that partition would exist already. However that seems to be wrong. The key isn't passed
into InitOutputPartition so we cannot repeat the check there.


http://gerrit.cloudera.org:8080/#/c/4863/7/common/thrift/DataSinks.thrift
File common/thrift/DataSinks.thrift:

Line 70:   // partition keys, meaning partitions can be opened, written, and closed one by
one.
> I think we just use the double // here (and above)
Done. The one above was me some time ago, too :)


Line 71:   4: required bool input_is_clustered
> remove trailing semicolon
Done, and elsewhere in the file.


http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java:

Line 51:   protected final boolean inputIsClustered_;
> inputIsClustered_
Done


http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/TableSink.java
File fe/src/main/java/org/apache/impala/planner/TableSink.java:

Line 85:       boolean overwrite, boolean ignoreDuplicates, boolean inputIsClustered) {
> use Java style: inputIsClustered
Done


http://gerrit.cloudera.org:8080/#/c/4863/6/testdata/workloads/functional-query/queries/QueryTest/insert.test
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 912: partition (year, month) /*+ clustered,noshuffle */
> I missed it earlier that we don't have tests for unpartitioned inserts with
Done. I added a test and a DCHECK in HdfsTableSink::WriteClusteredRowBatch() to make sure
we're performing a partitioned insert.

However I'm not sure now whether we actually should allow specifying /*+ clustered */ for
non-partitioned tables at all, since it won't have any impact on the plan and will just be
silently ignored.


http://gerrit.cloudera.org:8080/#/c/4863/7/testdata/workloads/functional-query/queries/QueryTest/insert.test
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 863: insert into table alltypesinsert
> do we have coverage of the clustered with and without shuffling?
So far we only had coverage in the FE. Added a test. Should we run all tests with shuffle
and noshuffle? I assume shuffle makes more sense since we mostly will want to use clustered
with large inserts. I added it to the other tests.


http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert.py
File tests/query_test/test_insert.py:

Line 112:   def test_insert_test(self, vector):
> ?
This makes it possible to test only "insert.test" by specifying "-k test_insert_test" to impala-py.test.
A lot of our BE tests have this issue of not being able to just select the main test by -k.
Should I revert this, or add a comment? It's explained in the commit message.


http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 481:     table_path = get_fs_path(
> don't we need to add the filesystem prefix for S3 and local FS?
I wasn't aware that the S3 and local FS cases were handled by the Hdfs code. I added get_fs_path()
calls and started private runs for both S3 and local FS and will update this once they're
done.


Line 489: 
> add shuffle hint to make sure we shuffle
Done


Line 507:     # This test takes about 30 seconds and we are unlikely to break it, so only
run it in
> give reason. takes too long?
Added a comment.


Line 509:     if self.exploration_strategy() != 'exhaustive':
> same comment about the fs prefix
See comment above.


Line 534: 
> does this test make any assumptions about whether the shuffling behavior? b
No, it doesn't make any assumptions. I added the hint.


Line 554:                l_returnflag
> it feels like this could become flaky relatively easily (changing compressi
What would be good values here? Should we leave it and see if things break and then gradually
adapt?


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message