impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Jacobs (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-3376: Extra definition level when writing Parquet files
Date Tue, 19 Jul 2016 22:43:33 GMT
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3376: Extra definition level when writing Parquet files
......................................................................


Patch Set 5:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS5, Line 381: Encoding may fail for several reasons - because the current page is not big
enough,
             :     // because we've encoded the maximum number of unique dictionary values
and need to
             :     // switch to plain encoding, etc. so we may need to try again more than
once.
I haven't spent a ton of time looking through all the table-writer code, so this could be
a non-issue, but I'm a bit worried that a subtle bug in EncodeValue/FinalizeCurrentPage/NewPage
could lead to infinite loops here, perhaps in corner cases with weird data. Is there a clear
set of state transitions? This relies on EncodeValue() behaving properly, and it is hard to
read this code and understand why it is _obviously correct_. I don't think your code increases
the risk of issues, but worth thinking about any DCHECKs that could help. I haven't spent
a ton of time looking through the rest of this code so maybe it's not an issue.


http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/util/parquet-reader.cc
File be/src/util/parquet-reader.cc:

PS5, Line 133: We i
Remove we


PS5, Line 146: with our RLE scheme it is not possible to determine how many values
             : //     were actually written if the final run is a literal run, only if the
final run is
             : //     a repeated run.
We can't we determine how many values were written in a literal run?


PS5, Line 149: CheckDataPage
I think the decompressing is getting confusing with the memory management. How about splitting
out the decompression into a separate fn that takes both the compressed data buffer and a
buffer already allocated by the caller (which should be of size header.uncompressed_page_size).
Then the fn that actually does the work to check a data page can just take a const uint8_t*
to uncompressed data.


PS5, Line 149: uint8_t* data
Please have the comment mention that data is decompressed if the header indicates it is compressed,
and that this is an in/out parameter that will return the uncompressed data.


PS5, Line 150: std::vector<uint8_t> decompressed_buffer;
why is this stack allocated? Isn't this out of scope why this fn returns but you return the
pointer?


PS5, Line 171: *reinterpret_cast<int*>(data);
Can you add 1 sentence about the data layout or point to somewhere that does?


PS5, Line 174:  
nit extra space


-- 
To view, visit http://gerrit.cloudera.org:8080/3556
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message