impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4815, IMPALA-4817, IMPALA-4819: Populate Parquet Statistics for remaining types
Date Fri, 05 May 2017 14:58:54 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-4815, IMPALA-4817, IMPALA-4819: Populate Parquet Statistics for remaining
types
......................................................................


Patch Set 6:

(22 comments)

Thank you for your review. Please see PS6 and my inline comments.

http://gerrit.cloudera.org:8080/#/c/6563/4/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 375:   scoped_ptr<ColumnStats<T>> page_stats_;
> why this change?
The stats now need to have their memory pool reset in Reset() (L297), so I followed the surrounding
code (DictEncoder). Should we instead keep the objects and pass the memory pool to page_stats_.Reset()?


http://gerrit.cloudera.org:8080/#/c/6563/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 34: /// decode parquet::Statistics into slots.
> update
Done


Line 60:   /// Enum to select whether to read minimum or maximum statistics. Values do not
> "do not"
Done


Line 73:       const ColumnType& col_type, const parquet::ColumnOrder* col_order,
> const& isn't necessary here
Done


Line 82:   /// batch, since the batch memory will be released. Overwrite this method in derived
> is this useful for anything other than strings/var-len data? if not, make n
Done


Line 83:   /// classes to provide the functionality.
> ) {}
git-clang-format does this. If you feel strongly about it, I'll add a TODO for myself to add
the space before submitting the change.


Line 105:   /// 'col_order'. Otherwise, returns false.
> CanUse-? or should as in 'would be preferable'/'there's a moral imperative'
Done


Line 108: 
> i had to remind myself again what T was. would be useful to point out in th
Done


Line 127:         || std::is_same<bool, T>::value
> describe params
Done


Line 135:  public:
> you should point out that it may keep a reference to 'v' until Materialize.
Done


Line 144:       min_buffer_(mem_pool),
> this doesn't need to be part of the public interface, columstatsbase can be
Done


Line 152:   virtual void Merge(const ColumnStatsBase& other) override;
> static
Done. Making it static required passing an additional parameter.


Line 154: 
> you don't mention here that this is plain-encoded.
Done


Line 158:  protected:
> why -internal?
Done


Line 161:   static void EncodePlainValue(const T& v, int64_t bytes_needed, std::string*
out);
> why -internal?
Done


http://gerrit.cloudera.org:8080/#/c/6563/4/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 28: inline void ColumnStats<T>::Update(const T& v) {
> again, you changed the meaning of a variable but not the name.
Done


Line 30:     has_values_ = true;
> this whole function does almost nothing. is it worth having it (extra level
Done


Line 189:     }
> why do you need T here?
It seemed cleaner to me to define it for all types. Removing T required moving the declaration
before its first usage.


http://gerrit.cloudera.org:8080/#/c/6563/4/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 563:   2: required i64 total_byte_size
> this has now changed, please take a look at
Done


http://gerrit.cloudera.org:8080/#/c/6563/4/testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test:

Line 138: select count(*) from deprecated_stats where int_col < 0 and timestamp_col !=
"2009-02-01 00:00:00"
> stray line?
Done


http://gerrit.cloudera.org:8080/#/c/6563/4/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test:

Line 290: select count(*) from functional_parquet.alltypessmall where string_col < "1"
> use a predicate that will result in filter? (so we can see it also works wi
I'm not sure I understand your comment. The test in L274 filters all rows using "<". Can
you post the query you have in mind?


http://gerrit.cloudera.org:8080/#/c/6563/4/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 60:     # 'timetuple' will the datetime __eq__ function forward an unknown equality check
to
> 'timetuple' will the datetime __eq__ function forward...
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Attila Jeges <attilaj@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message