impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Date Fri, 27 Jan 2017 01:08:22 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 6:

(13 comments)

Thanks for the review. Please see PS7.

http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS6, Line 136: <
> <=?
Done


Line 145:   virtual void Reset() {
> Do we need to reset the row group stats at this point?
Yes, good catch. In PS5 this was done in ColumnWriter::Reset() and I added it there again.


PS6, Line 178: ProcessValue
> Maybe AppendValue() or something like that? Process is a little vague.
Marcel had suggested that name, but I'm good with either. Marcel, do you have a strong preference?


Line 389:     bool ret = bool_values_->PutValue(v, 1);
> Consider something like:
Done, though it has the same number of lines, but now uses two return statements. Is there
a strong reason to prefer one over the other?


PS6, Line 644: <
> <=?
Done


PS6, Line 648: column
> row group
Done


http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

Line 105:   static const int MAX_COLUMN_STATS_SIZE = 4 * 1024;
> Mention that this value came from parquet-mr?
Done


http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS6, Line 125: min
> Others might have different opinions but I think I'd prefer making this eve
For this change I would think it is ok to handle min/max implicitly, since we only add support
for numerical types. When adding more complex types we could add a comparator method to ColumnStats<T>
that does the comparison in a type specific way for all types where we want to have a more
explicit implementation. I'm also good with doing that right now, but I added a TODO first
to get additional feedback from other reviewers first.


Line 139:     out->__set_min(min_str);
> It probably doesn't really matter, but I think using move() here would a co
Done


PS6, Line 150: ((uint8_t*)
> reinterpret_cast
Done


Line 183: /// parquet-mr and subsequently Hive currently do not handle the following types
> We should file an Impala JIRA to keep track of this too
I created separate JIRAs and added them to the comment.


http://gerrit.cloudera.org:8080/#/c/5611/6/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 208: class TestHdfsParquetTableStatsWriter(ImpalaTestSuite):
> It would be good to have some kind of test for multiple row groups so that 
Good idea! As you suspect, it is hard to deterministically write row groups across multiple
files. I added a test using the new sortby() hint.


Line 309:       (0, 7299),
> It would be nice to test a table with a negative number to make sure that t
Done. We don't have tables with negative numbers in them in 'functional' (TIL), so I used
a view.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Michael Brown <mikeb@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi+gerrit@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message