impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4817: Populate Parquet Statistics for Strings
Date Mon, 10 Apr 2017 16:20:06 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4817: Populate Parquet Statistics for Strings
......................................................................


Patch Set 2:

(7 comments)

I had a few high-level comments/questions.

http://gerrit.cloudera.org:8080/#/c/6563/2//COMMIT_MSG
Commit Message:

Line 25: TODO: This change still needs tests reading statistics from files which
We'll probably need to check in some files written with the old stats, right? Since we'll
switch the version of Hive at some point. Might be best to just do this now so that we've
got the test coverage.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 559:     // TODO: Here we need to pass in the column order (signed vs unsigned).
Are you going to fix that in the CR?


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 44: /// value (as opposed to their binary representation).
Add a comment to describe ordering for strings, to be consistent with specifying these orders.


Line 150: class ColumnStats : public TypedColumnStatsBase<T> {
I'm confused why we need a three-level hierarchy with additional template specialisation.
I don't understand what having the TypedColumnStatsBase level to the hierarchy buys us.

If we're going to use template specialisation it seems like we should be able to collapse
ColumnStats and TypedColumnStatsBase into one somehow


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 103:   // TODO: How to do memory management here?
Fixed?


http://gerrit.cloudera.org:8080/#/c/6563/1/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 437:     """Test that Impala does not write statistics for char columns."""
Comment out-of-date?


Line 445:     ]
Can we add some tests involving "interesting" characters. E.g. '\0' and some bytes >= 128.
I think these should "just work" in our implementation but it would be good to have the coverage,
since these things were broken with the old Hive stats.


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message