impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5412 Scan returns wrong partition-column values when scanning multiple partitions pointing to the same filesystem location.
Date Wed, 09 Aug 2017 15:54:53 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5412 Scan returns wrong partition-column values when scanning multiple
partitions pointing to the same filesystem location.
......................................................................


Patch Set 1:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7625/1/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 90:           context->partition_descriptor()->id(),stream_->filename()));
nit:space after ,


http://gerrit.cloudera.org:8080/#/c/7625/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

PS1, Line 266: string
Do we need to call the string constructor explicitly? Seems a bit weird.


PS1, Line 634: name) {
Nit: long line > 90 chars


Line 648: void* HdfsScanNodeBase::GetFileMetadata(
Nit: I think this line fits in 90 characters.


http://gerrit.cloudera.org:8080/#/c/7625/1/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 197:   /// Allocate a new scan range object, stored in the runtime state's object pool.
For
This comment needs updating since partition_id is now required.


Line 360:   struct pair_hash {
We have some utilities like this already in util/container-util.h, let's move it there.


Line 363:         return std::hash<T2>{}(p.second);
Let's hash both values in the pair and combine them with boost::hash_combine. The current
approach probably works ok for this use case but it seems better to implement a general-purpose
pair hash that can be reused, given it's not much more work.


http://gerrit.cloudera.org:8080/#/c/7625/1/tests/metadata/test_partition_metadata.py
File tests/metadata/test_partition_metadata.py:

Line 43:     """Regression test for IMPALA-597. Verifies Impala is able to properly read
Can we run this test for Parquet too?

TestInsertQueries has text and parquet in its test matrix and uses the "STORED AS" clause
too, so we can copy its approach.


Line 71:     data = self.execute_scalar("select sum(i), sum(j) from %s" % FQ_TBL_NAME)
Can we run this query with num_nodes=1 to verify that the same bug doesn't exist with text
file?


Line 80:   def test_multiple_partitions_same_location_avro(self, vector, unique_database):
Test looks good, I think we should just make sure we address the full test coverage gap for
the other file formats too.

Let's run this test for the other affected formats with an unsupported writers - SequenceFile.

If we change the test matrix for this class as I suggest above, we probably want to pull out
this test into a separate class with a different test matrix.


PS1, Line 81: imapala
Impala


Line 110:     # (note, that shortcoming on avro writer would result the inserted value as
NULL,
How about we only query the second column to avoid this complication?


-- 
To view, visit http://gerrit.cloudera.org:8080/7625
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie74b305377248045c0d87b911943e1cabb7223e9
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Gabor Kaszab <gaborkaszab@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Attila Jeges <attilaj@cloudera.com>
Gerrit-Reviewer: Laszlo Gaal <laszlo.gaal@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message