impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4815, IMPALA-4817, IMPALA-4819: Populate Parquet Statistics for remaining types
Date Fri, 05 May 2017 23:37:31 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-4815, IMPALA-4817, IMPALA-4819: Populate Parquet Statistics for remaining
types
......................................................................


Patch Set 6:

(4 comments)

Please see my inline comments, especially on CHAR support. I'd like to clarify on them before
pushing a new PS.

http://gerrit.cloudera.org:8080/#/c/6563/6/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 543:     if (col_idx < col_orders.size()) col_order = &col_orders[col_idx];
> what does this mean if col_orders is not empty? (does the spec allow a part
The spec is not clear to me, so I assumed that a different writer may write it partially.
In that case we cannot assume it is empty, but we could assume the stats are broken and ignore
them for the whole row group. Should we do that? Alternatively we could document my assumption
in a comment.


http://gerrit.cloudera.org:8080/#/c/6563/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 1016:   file_metadata_.__isset.column_orders = true;
> do we have a test that checks the integrity of generated parquet data? if s
I added a test to TestHdfsParquetTableWriter. In addition, if these don't get set, other stats
tests will fail.


http://gerrit.cloudera.org:8080/#/c/6563/6/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

Line 104:         StringValue* result = reinterpret_cast<StringValue*>(slot);
> why no padding here?
I assumed that the comparisons on StringValues work on variable length data.

I wasn't aware that our comparisons are broken. Do you have a JIRA? Should we just disable
support for stats for CHAR columns?


http://gerrit.cloudera.org:8080/#/c/6563/6/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test:

Line 418: # Test CHAR type columns.
> our char implementation is broken, i'm not sure these tests are meaningful.
See my comment in the stats.cc file. Should we remove stats support for CHAR columns altogether?


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Attila Jeges <attilaj@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message