impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Date Sun, 29 Jan 2017 22:19:21 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer

Patch Set 9:

File be/src/exec/

Line 134:   void EncodeColumnStats(ColumnMetaData* meta_data) {
find a better name. 'column stats' is not a thrift concept. these are specifically row group

Line 236:   // Created and set by the derived class.
owner? same for the other pointer members.

Line 339:   int64_t encoded_value_size_;
this seems to be the plain encoding size. even for dict-encoded cols?

Line 347:   // Tracks statistics per row group. This gets reset when starting a new file.
hopefully when starting a new row group

Line 643:   DCHECK(page_stats_base_ != nullptr);
how does this handle unsupported types?

Line 1028:     columns_[i]->EncodeColumnStats(&current_row_group_->columns[i].meta_data);
where do the row group stats get reset?
File be/src/exec/hdfs-parquet-table-writer.h:

Line 103:   /// Maximum statistics size. If the combined size of the min and max values of
does this refer to a single thrift Statistics struct? if so, spell that out.
File be/src/exec/parquet-column-stats.h:

Line 65:   void EncodeToThrift(T* parent) const {
this feels more convoluted than it needs to be. i think it would be better for this class
only to deal with thrift::Statistics and let the caller make the appropriate __set_xxx call
(which means you won't need a templatized function).

Line 88:   // We explicitly require types to be listed here in order to support column statistics.
i don't understand, i thought those listed types are specifically not supported. what exactly
does this do?

Line 90:   // follow the ordering semantics of parquet's min/max statistics for the new type.
what are the ordering semantics? (that order as byte sequence == value order?)

Line 97:       T>::type;
i find the formatting hard to decipher. please reformat by hand (for instance, by move the
first is_arithmetic to a new line, which would make the argument grouping clearer).

Line 127:       // statistics behavior from any implicit behavior of the types?
but shouldn't the stats reflect the behavior of the underlying types. ie, why should the stats
'<' be any different than the '<' of the underlying type?

Line 148:   /// Encodes a single value into an output string using parquet's plain encoding.
'an output string' makes it sound like this gets converted into a string type, ie, byte_array
in parquet parlance. but plain encoding requires int32, int64, etc., parquet types. you're
encoding as 'plain', stored in a binary string. best to make that clear in the comment. (also,
what does 'output' mean here?)

Line 159:     return encoded_value_size_ < 0 ? ParquetPlainEncoder::ByteSize<T>(v)
reformat by hand
File be/src/exec/parquet-common.h:

Line 89:   static int ByteSize(const T& v) { return sizeof(T); }
does this function make sense at all? why not simply call sizeof()?
File tests/util/

Line 90:   """Decode parquet statistics values that are encoded with PLAIN encoding."""
"that are encoded": do you mean "expects 'value' to be plain encoded"?

also, why is this specific to stats (as opposed to any plain-encoded value)?

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Michael Brown <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-Reviewer: Zoltan Ivanfi <>
Gerrit-HasComments: Yes

View raw message