impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Date Thu, 05 Jan 2017 22:38:22 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 1:

(15 comments)

I think the overall approach makes sense. I had some higher-level comments - I didn't do a
full final pass since there will probably be some churn from this round of comments.

http://gerrit.cloudera.org:8080/#/c/5611/1//COMMIT_MSG
Commit Message:

Line 9: TODO Check whether we can use python bindings for parquet-cpp or pyarrow
Do we have end-to-end testing that the column stats are correct? We really don't want to turn
this on before it's totally right, since if we write out bad data, then the files can exist
for a long time.

Until we are totally confident that we're writing out the correct stats, we should turn this
off by default and hide it behind an experimental flag.


http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 90: class HdfsParquetTableWriter::ColumnStatsBase {
I feel like this file is getting pretty large. Can we pull these classes out into a separate
header?

Also, do the classes need to be nested inside HdfsParquetTableWriter?


Line 97:   virtual void Extend(const ColumnStatsBase* other) = 0;
I think we call this Merge() in other places like aggregate functions.


Line 101:   virtual void CopyValues(MemPool* mem_pool){};
I think the memory management here needs a bit more explanation. I.e. when it is ok for the
object to reference memory that is not from 'mem_pool'.


Line 206:       min_value_ = min(min_value_, v);
Do we know what ordering the parquet specification defines for each type? The non-obvious
things for me are:

* How are NULL values handled/ordered?
* How are strings ordered?
* Is it always going to be the same as our own overloaded operators? This makes me uneasy
since if someone adds a new data type they probably won't think to look at the Parquet spec
before implementing the overloaded operator. I would  prefer this to be more explicit even
if it results in more code.

We should probably explain this in the class header. And also probably not rely on the overloaded
operators.


Line 224: // StringValues need to be copied at the end of processing a row batch, since the
batch
I guess this is the explanation of Copy() I was looking for. I think we need to document it
in the base class since it's in the base class API.


Line 288:     char* new_ptr = reinterpret_cast<char*>(mem_pool->Allocate(value->len));
This can potentially use a lot of memory since the old strings are never freed. I think using
StringBuffer would avoid that problem.


PS1, Line 349: CopyValues
Maybe this shouldn't be CopyValues, rather CopyMemory? or CopyReferencedMemory, since we don't
always copy the value, e.g. if it's an integer? Not sure what a good name would be.


Line 485:     page_stats_.reset(new ColumnStats<T>(encoded_value_size_));
Can't we initialise most of these using the initialiser list? We should do that for consistency.


Line 598:     page_stats_.reset(new ColumnStats<bool>());
Can't we initialise most of these using the initialiser list? We should do that for consistency.


Line 628:   scoped_ptr<ColumnStats<bool>> page_stats_;
Do we need scoped_ptr? Can we just store these inline?


http://gerrit.cloudera.org:8080/#/c/5611/1/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 214: 
Extra line


Line 281:       # Timestamps are written differently by Hive and Impala so we don't validate
them
I think we need to find a way to test timestamp.


Line 293:         "functional.alltypes" % qualified_table_name)
We should also test the types that aren't in alltypes - char, varchar, decimal.


Line 299:     # clause ensures that the query is executed on the coordinator, resulting in
a single
Does this exercise the string copying code? If the table is too small it seems like it may
not.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message