impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer
Date Thu, 19 Jan 2017 00:00:24 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 2:

(12 comments)

I need to read through the test code but I thought I'd flush out my comments about the implementation.
I think I understand how it all works now, but it would be good to make it a bit easier to
follow.

http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 206:     // levels data and the encoded values.  If compression is enabled, this is the
> I added a comment to the class header. To me it looks safe to rely on the o
Yeah I think it would be good to stamp out. I'm not sure how likely we are to run into problems
but it's hard to anticipate what new data types we might add.


Line 598: 
> I made the *_stats_ inline members, so their default initialization is suff
Ah ok, didn't realise that was the reason. I'll take a look at the new code.


http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 355:   // Tracks statistics per column. This gets reset when starting a new file.
I feel like the page/column terminology is confusing here. If page_stats_ is per page, shouldn't
this be per file?

Both of these are stats for the column, right?


Line 974:   for (int j = 0; j < columns_.size(); ++j) columns_[j]->CopyStatsMemory();
Why j?

for (BaseColumnWriter* column : columns_) would be more readable.


http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

Line 104:   /// them.
Is this per value? I.e. you can have min + max add up to 8kb.


http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 33: /// Regarding the ordering of values, we follow the parquet-mr reference implementation.
Thanks, this is very helpful.


PS2, Line 35: DECIMAL
We write decimals as a big-endian FIXED_LEN_BYTE_ARRAY type, so the lexicographic order will
match the numeric order. We should call this out though since it's a bit subtle.

We should also mention that we don't support other encodings: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

Now that I think about it, decimal encoded into variable-length binary might be an example
where the logical ordering doesn't match the physical ordering, since 256 => "0x01 0x00"
would be sorted before 1 => "0x01" (if I understood everything correctly).


Line 39: /// TIMESTAMP values are written in the in-memory format used by Impala, relative
to UTC,
I think we convert them to INT96 for parquet. I think the INT96 order is equivalent to our
timestamp order, but we should be clear on this.


PS2, Line 88: initialized_
values_initialized_? 

We use initialized_ elsewhere in the backend to indicate whether an Init() method has been
called, so I initially thought this was the same thing when reading the code.


Line 165:   using TypedColumnStatsBase<T>::initialized_;
I don't understand why these "using"s are necessary.


PS2, Line 173: internal
Remove "internal"? I don't think it's necessary.


PS2, Line 266: && value->len == buffer->len()
I think the len check is redundant, right? It shouldn't point to the buffer unless it's the
right length already I think.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi+gerrit@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message