impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Date Mon, 30 Jan 2017 18:27:37 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 9:

(17 comments)

Thanks for the review. Please see my inline comments and the new PS.

http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS6, Line 178: ProcessValue
> that was just an off-the-cuff suggestion. if 'append' is more specific, the
I tend to favor ProcessValue, since Append (like Encode) leaves out the statistics part. While
ProcessValue is vague, it implies that the values is "consumed" afterwards, and that it may
be necessary to look at the comment for further details. That being said, how about "ConsumeValue"?
Does that imply ownership transfer too much?


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 134:   void EncodeColumnStats(ColumnMetaData* meta_data) {
> find a better name. 'column stats' is not a thrift concept. these are speci
Done


Line 236:   // Created and set by the derived class.
> owner? same for the other pointer members.
Done


Line 339:   int64_t encoded_value_size_;
> this seems to be the plain encoding size. even for dict-encoded cols?
Yes, the dictionary encoders use it as the size of the encoded values. Should we rename it
to plain_encoded_value_size_?


Line 347:   // Tracks statistics per row group. This gets reset when starting a new file.
> hopefully when starting a new row group
Yes, Done.


Line 643:   DCHECK(page_stats_base_ != nullptr);
> how does this handle unsupported types?
Currently the ColumnStats<T> gets created for all supported types, and the ColumnStats
c'tor ensures that the class can only be instantiated for supported types. However, for some
types we don't write out the statistics to the resulting parquet file by disabling the Update()
logic, thus ensuring that the stats objects stay empty. This allows us to keep the interface
of ColumnStats simpler.


Line 1028:     columns_[i]->EncodeColumnStats(&current_row_group_->columns[i].meta_data);
> where do the row group stats get reset?
Line 281 of this PS.


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

Line 103:   /// Maximum statistics size. If the combined size of the min and max values of
> does this refer to a single thrift Statistics struct? if so, spell that out
Done


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 65:   void EncodeToThrift(T* parent) const {
> this feels more convoluted than it needs to be. i think it would be better 
Done


Line 88:   // We explicitly require types to be listed here in order to support column statistics.
> i don't understand, i thought those listed types are specifically not suppo
We don't update statistics for these types, but creating ColumnStats for them is still supported.
This is because we want to support them soon. If we disallow creating them, then the caller
will have to deal with only creating ColumnStats for supported column types.


Line 90:   // follow the ordering semantics of parquet's min/max statistics for the new type.
> what are the ordering semantics? (that order as byte sequence == value orde
The values should be ordered according to what parquet-mr says (which is hopefully in line
with what Hive and Spark do). However, there is currently a lack of specification, especially
for logical types, that we're working on clarifying with the parquet-format project. I added
a comment explaining where to look for the ordering semantics.


Line 97:       T>::type;
> i find the formatting hard to decipher. please reformat by hand (for instan
Done


Line 127:       // statistics behavior from any implicit behavior of the types?
> but shouldn't the stats reflect the behavior of the underlying types. ie, w
The stats should follow the ordering semantics defined in parquet-format. Those are currently
underspecified, we're working on getting them fixed so we can write stats for other types,
too (see my other comments in this review). While I agree that they should ideally be the
same, I don't think we can guarantee or enforce that, since ultimately the parquet-format
project serves as the reference.


Line 148:   /// Encodes a single value into an output string using parquet's plain encoding.
> 'an output string' makes it sound like this gets converted into a string ty
Done


Line 159:     return encoded_value_size_ < 0 ? ParquetPlainEncoder::ByteSize<T>(v)
:
> reformat by hand
Done


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

Line 89:   static int ByteSize(const T& v) { return sizeof(T); }
> does this function make sense at all? why not simply call sizeof()?
It is specialized for several types below. Should I add a comment here?


http://gerrit.cloudera.org:8080/#/c/5611/9/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

Line 90:   """Decode parquet statistics values that are encoded with PLAIN encoding."""
> "that are encoded": do you mean "expects 'value' to be plain encoded"?
I changed the comment. The notable difference between statistics and stats is that this method
does only support single-bit boolean values. Additionally, binary arrays in statistics don't
follow plain encoding semantics but are stored directly in the thrift objects. I extended
the comment to make it more clear.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Michael Brown <mikeb@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi+gerrit@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message