impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Date Sun, 29 Jan 2017 22:19:21 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 9:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 134:   void EncodeColumnStats(ColumnMetaData* meta_data) {
find a better name. 'column stats' is not a thrift concept. these are specifically row group
stats.


Line 236:   // Created and set by the derived class.
owner? same for the other pointer members.


Line 339:   int64_t encoded_value_size_;
this seems to be the plain encoding size. even for dict-encoded cols?


Line 347:   // Tracks statistics per row group. This gets reset when starting a new file.
hopefully when starting a new row group


Line 643:   DCHECK(page_stats_base_ != nullptr);
how does this handle unsupported types?


Line 1028:     columns_[i]->EncodeColumnStats(&current_row_group_->columns[i].meta_data);
where do the row group stats get reset?


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

Line 103:   /// Maximum statistics size. If the combined size of the min and max values of
does this refer to a single thrift Statistics struct? if so, spell that out.


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 65:   void EncodeToThrift(T* parent) const {
this feels more convoluted than it needs to be. i think it would be better for this class
only to deal with thrift::Statistics and let the caller make the appropriate __set_xxx call
(which means you won't need a templatized function).


Line 88:   // We explicitly require types to be listed here in order to support column statistics.
i don't understand, i thought those listed types are specifically not supported. what exactly
does this do?


Line 90:   // follow the ordering semantics of parquet's min/max statistics for the new type.
what are the ordering semantics? (that order as byte sequence == value order?)


Line 97:       T>::type;
i find the formatting hard to decipher. please reformat by hand (for instance, by move the
first is_arithmetic to a new line, which would make the argument grouping clearer).


Line 127:       // statistics behavior from any implicit behavior of the types?
but shouldn't the stats reflect the behavior of the underlying types. ie, why should the stats
'<' be any different than the '<' of the underlying type?


Line 148:   /// Encodes a single value into an output string using parquet's plain encoding.
'an output string' makes it sound like this gets converted into a string type, ie, byte_array
in parquet parlance. but plain encoding requires int32, int64, etc., parquet types. you're
encoding as 'plain', stored in a binary string. best to make that clear in the comment. (also,
what does 'output' mean here?)


Line 159:     return encoded_value_size_ < 0 ? ParquetPlainEncoder::ByteSize<T>(v)
:
reformat by hand


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

Line 89:   static int ByteSize(const T& v) { return sizeof(T); }
does this function make sense at all? why not simply call sizeof()?


http://gerrit.cloudera.org:8080/#/c/5611/9/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

Line 90:   """Decode parquet statistics values that are encoded with PLAIN encoding."""
"that are encoded": do you mean "expects 'value' to be plain encoded"?

also, why is this specific to stats (as opposed to any plain-encoded value)?


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Michael Brown <mikeb@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi+gerrit@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message