impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4815, IMPALA-4817, IMPALA-4819: Populate Parquet Statistics for remaining types
Date Wed, 12 Apr 2017 18:48:16 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-4815, IMPALA-4817, IMPALA-4819: Populate Parquet Statistics for remaining
types
......................................................................


Patch Set 4:

(19 comments)

Thank you for your comments. Please see my replies and PS4.

http://gerrit.cloudera.org:8080/#/c/6563/2//COMMIT_MSG
Commit Message:

Line 25: test using that file.
> We'll probably need to check in some files written with the old stats, righ
I created a file using Hive and added tests to read from it.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 541:     bool stats_read = false;
> bad variable name: this is a plain-encoded value. thrift_stats sounds like 
Done


Line 546:     if (fn_name == "lt" || fn_name == "le") {
> why did you break this up instead of having ReadFromThrift do the extra wor
Reworked it so it resembles the old flow. The additional check whether to read deprecated
stats is now inside ReadFromThrift.


Line 556:     }
> explicit comparison
Done


Line 559:   }
> Are you going to fix that in the CR?
We don't handle unsigned columns in parquet files at all, so I'm currently leaning towards
not handling unsigned stats, too. It looks like PR46 will not contain the ColumnOrder field
after all. We should revisit this when we add support for unsigned logical types to the Parquet
scanners.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 31: /// This class, together with its derivatives, is used to compute column statistics
when
> track is really not a meaningful term here, it generally just means 'follow
Done


Line 36: /// We currently support writing the 'min_value' and 'max_value' fields in
> hopefully also min and max, no? tracking means reading/decoding.
Done


Line 44: /// - Numeric values (BOOLEAN, INT, FLOAT, DOUBLE, DECIMAL) are ordered by their
numeric
> Add a comment to describe ordering for strings, to be consistent with speci
Done


Line 46: ///
> why is that still not the case?
Done


Line 66:   virtual ~ColumnStatsBase() {}
> you changed the type and meaning of a parameter, but you didn't change the 
Done


Line 72:   static bool ReadFromThrift(const parquet::ColumnChunk& col_chunk,
> unclear what this does, because copies are usually returned or passed into 
Done


Line 146:   virtual int64_t BytesNeeded() const override;
> why can't this be collapsed into a 2-level hierarchy?
Done


Line 150:   /// Encodes a single value using parquet's PLAIN encoding and stores it into the
> I'm confused why we need a three-level hierarchy with additional template s
Done


Line 151:   /// binary string 'out'.
> protected follows public section
Done


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 34: inline void ColumnStats<T>::Update(const T& v) {
> hard to compare, please move the functions back to where they were. feel fr
Done


Line 103: inline void ColumnStats<bool>::EncodeValueToString(
> Fixed?
Yes, this seems to be the way to go.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

Line 243:     string version_string = tokens[2];
> col_idx is unused.
We want to interpret the statistics based on the type of the column from our metadata, which
cannot be inferred from the parquet column type alone. It seems safer to me to rely on our
internal types here. Checking that the types found in the parquet file are compatible is done
in parquet-metadata-utils.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.h
File be/src/exec/parquet-metadata-utils.h:

Line 60:   /// (<major>.<minor>.<patch>). Unspecified parts default to 0,
and extra parts are
> this sounds like it's doing some reading.
Done


http://gerrit.cloudera.org:8080/#/c/6563/2/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 344:   BROTLI = 4;
> do we return an error when we see that codec?
Yes, we would catch that in ParquetMetadataUtils::ValidateColumn() where we validate the CompressionCodec
against a list of supported codecs.


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message