impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikramjeet Vig (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
Date Thu, 19 Oct 2017 20:34:56 GMT
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )

Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner
......................................................................


Patch Set 4:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc@1277
PS4, Line 1277:         switch (node.element->type) {
> This case is only going to get bigger with the follow on patches - it would
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@376
PS4, Line 376: inline int ParquetPlainEncoder::Decode<Decimal4Value, parquet::Type::BYTE_ARRAY>(
> I think you can avoid stamping this out if you leave it still parameterized
Unfortunately, the call to Decode is explicit in our code, eg Decode<T,S> therefore
it will always look for a specialization of those template parameters.
Similarly, in the dummy program you wrote, if I explicitly call foo<int, double>(&a,
&x), this will look for a specialization of the double templated  method and return 0;

Since partial specialization is not possible, the only way to reduce redundant code will be
to do what Lars suggested earlier, i.e., create a helper method that is only templatized on
<typename T> and call that for each full specialization.
I will make changes accordingly in the next iteration of this patch.


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@237
PS2, Line 237: 
> I think this particular call here will always return sizeof(int32_t) (line 
should we do that everywhere else in this file wherever we specialize a method(both Decode
and encode)?


http://gerrit.cloudera.org:8080/#/c/7822/2/be/src/exec/parquet-common.h@338
PS2, Line 338:   return fixed_len_size;
> I'm not sure I'm following: it looks like the next three methods are exactl
Oh I see what you mean now. Done!


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@66
PS4, Line 66: bool IsSupportedPhysicalType(PrimitiveType impala_type,
> let's make it static - we don't want to export the symbol to be linked with
Done. enclosed in an anonymous namespace.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@70
PS4, Line 70: (
> unnecessary parens
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@173
PS4, Line 173:     // TODO: Is this check too strict?
> I don't see why this shouldn't match the file metadata, this seems valid to
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@182
PS4, Line 182:  parquet::SchemaElement*
> Why not pass by const ref like above?
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@209
PS4, Line 209:         stringstream ss;
> Can you convert to using Substitute() while we're here? We've been very gra
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@176
PS4, Line 176:   TestType<Decimal4Value, parquet::Type::BYTE_ARRAY>(Decimal4Value(test_val),
> It would be nice to combine the FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY tests. 
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193
PS4, Line 193: int32_t
> Any reason to use sizeof(int..) instead of sizeof(Decimal*Value) like above
for var len decimals, its encoded size is bytes required to store size + least num of bytes
required to store num.

in this case bytes required to store numeric_limits<int32_t>::max() is sizeof(int32_t)
which is less than the sizeof(Decimal8Value) since Decimal8Value is stored using int64

I will add a comment at the top of these byte_array test to clarify what expected size is.
would you recommend I add more comments before limit tests?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@212
PS4, Line 212:   for (int i = 1; i <=16; ++i) {
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@104
PS4, Line 104: PHYSICAL_TYPE
> Isn't this PHYSICAL_TYPE the internal type, which is called LOGICAL_TYPE in
reverted to 'T' to minimize change from previous code


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@199
PS4, Line 199: PHYSICAL_TYPE
> It doesn't seem like the whole decoding class needs to be templated by this
Done


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/util/dict-encoding.h@323
PS4, Line 323: DictDecoder<Decimal16Value, parquet::Type::BYTE_ARRAY>::GetNextValue(
> The logic here is also the same as above - we should find a way to avoid th
removed, as it is no longer needed since the template param is removed and moved to Reset()
only


http://gerrit.cloudera.org:8080/#/c/7822/4/testdata/data/binary_decimal_dictionary.parquet
File testdata/data/binary_decimal_dictionary.parquet:

PS4: 
> Can you update the README to describe how these files were generated.
this file was generated a long while back and I picked it up from an abandoned patch here
https://gerrit.cloudera.org/#/c/5115/ 

Currently looking for a better way to generate this file using the java parquet client.



-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 4
Gerrit-Owner: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mjacobs@apache.org>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Oct 2017 20:34:56 +0000
Gerrit-HasComments: Yes

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message