impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Date Tue, 17 Jan 2017 23:04:42 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 2:

(15 comments)

Thanks for the review, and apologies for the long delay. I ran into several issues when comparing
statistics with Hive, until I found PARQUET-251, which seems to cause invalid stats to be
written by Hive.

http://gerrit.cloudera.org:8080/#/c/5611/1//COMMIT_MSG
Commit Message:

Line 9: Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
> Do we have end-to-end testing that the column stats are correct? We really 
There is a simple test in test_insert_parquet.py but it only checks rowgroup statistics, not
page statistics. I think it will be easier to write more extensive tests once the read path
has been implemented.


http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 90:   BaseColumnWriter(HdfsParquetTableWriter* parent, ExprContext* expr_ctx,
> I feel like this file is getting pretty large. Can we pull these classes ou
Done


Line 97:       num_values_(0),
> I think we call this Merge() in other places like aggregate functions.
Done


Line 101:       def_levels_(NULL),
> I think the memory management here needs a bit more explanation. I.e. when 
Done


Line 206:     // levels data and the encoded values.  If compression is enabled, this is the
> Do we know what ordering the parquet specification defines for each type? T
I added a comment to the class header. To me it looks safe to rely on the overloaded operators.
Should we try and explicitly stamp out the template class only for the supported types to
prevent new types from being added without manual interference? Are there any particular classes
that you think may break in the future?


Line 224:   scoped_ptr<Codec> compressor_;
> I guess this is the explanation of Copy() I was looking for. I think we nee
Done


Line 288:         new ColumnStats<T>(parent_->per_file_mem_pool_.get(), encoded_value_size_));
> This can potentially use a lot of memory since the old strings are never fr
Done


PS1, Line 349: value to h
> Maybe this shouldn't be CopyValues, rather CopyMemory? or CopyReferencedMem
Done


Line 485: 
> Can't we initialise most of these using the initialiser list? We should do 
I moved the initialization to Reset() to pass the correct memory pool, similar to dict_encoder_.


Line 598: 
> Can't we initialise most of these using the initialiser list? We should do 
I made the *_stats_ inline members, so their default initialization is sufficient. The *_stats_base
members are stored in the base class. I'm reluctant to move them into the c'tor, since for
all other column writers they will not be initialized at construction time, but when calling
Reset().


Line 628:   // TODO: copy repetition data when we support nested types.
> Do we need scoped_ptr? Can we just store these inline?
Done


http://gerrit.cloudera.org:8080/#/c/5611/1/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 214: 
> Extra line
Done


Line 281:     'hive_skip_col_idx' are excluded from the verification of the expected values.
> I think we need to find a way to test timestamp.
Done


Line 293: 
> We should also test the types that aren't in alltypes - char, varchar, deci
Done


Line 299:         (qualified_table_name, source_table, num_lines)
> Does this exercise the string copying code? If the table is too small it se
I added a test using tpch_parquet.customer and manually verified that the resulting .parq
file has multiple pages per column, thus exercising the string copy code. In hindsight it
should already be sufficient to use a file with more multiple row batches, which functional.alltypes
(7300 rows) should provide. I still think it makes sense to have a test with multiple pages
per column so I'd opt to keep the new one, too.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message