impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Date Wed, 01 Nov 2017 20:53:08 GMT
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 7:

(2 comments)

I think we'll need to do some more work on testing for the int32/64 patches - we don't have
pre-existing test files from parquet-mr. I think we'll have to generate some more test files
with parquet-mr for the other cases, and we could consider turning that code into a data generator
to generate more test files. From what I could tell Hive doesn't have a knob to generate some
of these alternative output encodings.

 I feel ok with the coverage since we have end-to-end tests then more exhaustive unit tests
for the various ways of encoding it.

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-common.h@391
PS7, Line 391: fixed_len_size
Looked again. The variable name (and recycling the argument storage) is confusing. Maybe 'encoded_byte_size'?


http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/7/be/src/exec/parquet-plain-test.cc@33
PS7, Line 33: int EncodeVarLenDecimal(const DECIMAL_TYPE& t, int fixed_len_size, uint8_t*
buffer){
I took another look at the standard and it says that the minimum number of bytes required
to store the unscaled value should be used: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

I think this means that we should not be including any preceding "0" bytes. I.e. we should
not have a fixed_len_size argument and instead determine the size based on the number of leading
zero bytes in the value.



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 7
Gerrit-Owner: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mjacobs@apache.org>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 20:53:08 +0000
Gerrit-HasComments: Yes

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message