Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm
Precedence: bulk
Message-Id: <201701172304.v0HN4gcS025831@ip-10-146-233-104.ec2.internal>
Date: Tue, 17 Jan 2017 23:04:42 +0000
From: "Lars Volker (Code Review)" <gerrit@cloudera.org>
To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org
CC: Tim Armstrong <tarmstrong@cloudera.com>
Reply-To: lv@cloudera.com
Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-3909=3A_Populate_min/max_statistics_in_Parquet_writer=0A?=
In-Reply-To: <gerrit.1483635391000.I8368ee58daa50c07a3b8ef65be70203eb941f619@gerrit.cloudera.org>
References: <gerrit.1483635391000.I8368ee58daa50c07a3b8ef65be70203eb941f619@gerrit.cloudera.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Content-Disposition: inline
User-Agent: Gerrit/2.12.2
archived-at: Tue, 17 Jan 2017 23:04:49 -0000

Lars Volker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 2:

(15 comments)

Thanks for the review, and apologies for the long delay. I ran into several issues when comparing statistics with Hive, until I found PARQUET-251, which seems to cause invalid stats to be written by Hive.

http://gerrit.cloudera.org:8080/#/c/5611/1//COMMIT_MSG
Commit Message:

Line 9: Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
> Do we have end-to-end testing that the column stats are correct? We really 
There is a simple test in test_insert_parquet.py but it only checks rowgroup statistics, not page statistics. I think it will be easier to write more extensive tests once the read path has been implemented.


http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 90:   BaseColumnWriter(HdfsParquetTableWriter* parent, ExprContext* expr_ctx,
> I feel like this file is getting pretty large. Can we pull these classes ou
Done


Line 97:       num_values_(0),
> I think we call this Merge() in other places like aggregate functions.
Done


Line 101:       def_levels_(NULL),
> I think the memory management here needs a bit more explanation. I.e. when 
Done


Line 206:     // levels data and the encoded values.  If compression is enabled, this is the
> Do we know what ordering the parquet specification defines for each type? T
I added a comment to the class header. To me it looks safe to rely on the overloaded operators. Should we try and explicitly stamp out the template class only for the supported types to prevent new types from being added without manual interference? Are there any particular classes that you think may break in the future?


Line 224:   scoped_ptr<Codec> compressor_;
> I guess this is the explanation of Copy() I was looking for. I think we nee
Done


Line 288:         new ColumnStats<T>(parent_->per_file_mem_pool_.get(), encoded_value_size_));
> This can potentially use a lot of memory since the old strings are never fr
Done


PS1, Line 349: value to h
> Maybe this shouldn't be CopyValues, rather CopyMemory? or CopyReferencedMem
Done


Line 485: 
> Can't we initialise most of these using the initialiser list? We should do 
I moved the initialization to Reset() to pass the correct memory pool, similar to dict_encoder_.


Line 598: 
> Can't we initialise most of these using the initialiser list? We should do 
I made the *_stats_ inline members, so their default initialization is sufficient. The *_stats_base members are stored in the base class. I'm reluctant to move them into the c'tor, since for all other column writers they will not be initialized at construction time, but when calling Reset().


Line 628:   // TODO: copy repetition data when we support nested types.
> Do we need scoped_ptr? Can we just store these inline?
Done


http://gerrit.cloudera.org:8080/#/c/5611/1/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 214: 
> Extra line
Done


Line 281:     'hive_skip_col_idx' are excluded from the verification of the expected values.
> I think we need to find a way to test timestamp.
Done


Line 293: 
> We should also test the types that aren't in alltypes - char, varchar, deci
Done


Line 299:         (qualified_table_name, source_table, num_lines)
> Does this exercise the string copying code? If the table is too small it se
I added a test using tpch_parquet.customer and manually verified that the resulting .parq file has multiple pages per column, thus exercising the string copy code. In hindsight it should already be sufficient to use a file with more multiple row batches, which functional.alltypes (7300 rows) should provide. I still think it makes sense to have a test with multiple pages per column so I'd opt to keep the new one, too.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes