impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Date Tue, 31 Jan 2017 16:17:04 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 10:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 136:     if (row_group_stats_base_->has_values()
oops, i missed that bug


Line 339: 
> Yes, the dictionary encoders use it as the size of the encoded values. Shou
sounds good


http://gerrit.cloudera.org:8080/#/c/5611/10/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 139:       row_group_stats_base_->EncodeToThrift(&encoded);
why not just pass in metadata->statistics and then set the __isset flag by hand?


Line 652:     page_stats_base_->EncodeToThrift(&encoded);
avoid copy by passing in header.data_page_header.statistics


http://gerrit.cloudera.org:8080/#/c/5611/10/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

Line 103:   /// Maximum statistics size. If the size of a single thrift Statistics struct
for a page
qualify as 'parquet.Statistics' so it's clearer


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 127:     }
> The stats should follow the ordering semantics defined in parquet-format. T
i understand how that may not be the case today, but in order for them to be useful, they
need to mirror the ordering semantics of the underlying type. meaning i still don't see why
there should be stats-specific comparisons and why we want this todo.


http://gerrit.cloudera.org:8080/#/c/5611/10/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 84:   // clang-format off
remove


Line 87:       || std::is_same<bool, T>::value
indent the subsequent lines belonging to the logical expr two more spaces (style guide)


Line 157:     // clang-format off
this is very verbose. why needed?


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Michael Brown <mikeb@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi+gerrit@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message