impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikramjeet Vig (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types
Date Fri, 01 Sep 2017 22:34:52 GMT
Bikramjeet Vig has uploaded a new patch set (#3).

Change subject: IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types
......................................................................

IMPALA-5628: (Draft) Add support for reading additional Parquet Decimal Types

Added support for reading Parquet decimal Types encoded in :
1) Int32
2) Int64
3) Binary (Variable size byte array)

A few points to look at:

1) the test table I generated to read INT64 encoded decimals shows null
values for the whole row if a decimal column exists with precision less
than or eqaul to 9 and encoded in Int64. If I alter table and change it
to 10 or above, it displays the data correctly. One can infer that there
is some issue with impala doing some internal conversion after reading
it from the parquet reader, since a precision less than 10 should be
stored using a Decimal4 (stores value in int32) but it gets a Decimal8
from the parquet reader. Any idea on what might be going wrong?

2) Currently the test for ParquetPlainEncoder contains templatized test
methods that encode and decode a passed value. Since we had encode and
decode methods for all types that we supported in both writer and reader,
this was fine. now to support the types added in this commit, where
only the reader supports it (hence only decode method exists),
I would either have to write a completely new test method or specialize
the existing test method for every permutation of <Decimal, Parquet::type>.
In any case a lot of new code will be added, which option do you think is better?

3) Waiting on confirmation that column stats should ne stored with the
exact same format (same <Converted Type, Primitive Type> as the column).
If we do get a confirmation that this is the correct. I would have to pass
a SchemaElement object of that column to ColumnStatsBase::ReadFromThrift
for handeling the case where the type is <Decimal, FIXED_LEN_BYTE_ARRAY>.
We would be able to use the type_length attribute to use the correctly
sized DecimalXValue in DecodePlainValue for reading the stat.

4) Will perf regression test be good enough or do you think we need
to do perf tests for the newly added types too?

Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-common.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A testdata/data/binary_decimal_dictionary.parquet
A testdata/data/binary_decimal_no_dictionary.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
17 files changed, 550 insertions(+), 244 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/7822/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>

Mime
View raw message