drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4184) Drill does not support Parquet DECIMAL values in variable length BINARY fields
Date Tue, 09 Feb 2016 23:07:18 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139970#comment-15139970
] 

ASF GitHub Bot commented on DRILL-4184:
---------------------------------------

GitHub user daveoshinsky opened a pull request:

    https://github.com/apache/drill/pull/372

    DRILL-4184: support variable length decimal fields in parquet

    Support decimal fields in parquet that are stored as variable length BINARY.  Parquet
files that store decimal values this way are often significantly smaller than ones storing
decimal values as FIXED_LEN_BYTE_ARRAY's (full precision).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/daveoshinsky/drill master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/372.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #372
    
----
commit 9a47ca52125139d88adf39b5d894a02f870f37d9
Author: U-COMMVAULT-NJ\doshinsky <doshinsky@daveoshinsky-pc.gp.cv.commvault.com>
Date:   2016-02-09T22:37:47Z

    DRILL-4184: support variable length decimal fields in parquet

commit dec00a808c99554f008e23fd21b944b858aa9ae0
Author: daveoshinsky <doshinsky@daveoshinsky-pc.gp.cv.commvault.com>
Date:   2016-02-09T22:56:28Z

    DRILL-4184: changes to support variable length decimal fields in parquet

----


> Drill does not support Parquet DECIMAL values in variable length BINARY fields
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-4184
>                 URL: https://issues.apache.org/jira/browse/DRILL-4184
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.4.0
>         Environment: Windows 7 Professional, Java 1.8.0_66
>            Reporter: Dave Oshinsky
>
> Encoding a DECIMAL logical type in Parquet using the variable length BINARY primitive
type is not supported by Drill as of versions 1.3.0 and 1.4.0.  The problem first surfaces
with the ClassCastException shown below, but fixing the immediate cause of the exception is
not sufficient to support this combination (DECIMAL, BINARY) in a Parquet file.
> In Drill, DECIMAL is currently assumed to be INT32, INT64, INT96, or FIXED_LEN_BINARY_ARRAY.
 Are there any plans to support DECIMAL with variable length BINARY?  Avro definitely supports
encoding DECIMAL in variable length bytes (see https://avro.apache.org/docs/current/spec.html#Decimal),
but this support in Parquet is less clear.
> Selecting on a BINARY DECIMAL field in a parquet file throws an exception as shown below
(java.lang.ClassCastException: org.apache.drill.exec.vector.Decimal28SparseVector cannot be
cast to org.apache.drill.exec.vector.VariableWidthVector).  The successful query at bottom
selected on a string field in the same file.
> 0: jdbc:drill:zk=local> select count(*) from dfs.`c:/dao/DBArchivePredictor/tenrows.parquet`
where acct_no=70000020;
> org.apache.drill.common.exceptions.DrillRuntimeException: Error in parquet recor
> d reader.
> Message: Failure in setting up reader
> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message sbi.acct_mstr {
>   required binary ACCT_NO (DECIMAL(20,0));
>   optional binary SF_NO (UTF8);
>   optional binary LF_NO (UTF8);
>   optional binary BRANCH_NO (DECIMAL(20,0));
>   optional binary INTRO_CUST_NO (DECIMAL(20,0));
>   optional binary INTRO_ACCT_NO (DECIMAL(20,0));
>   optional binary INTRO_SIGN (UTF8);
>   optional binary TYPE (UTF8);
>   optional binary OPR_MODE (UTF8);
>   optional binary CUR_ACCT_TYPE (UTF8);
>   optional binary TITLE (UTF8);
>   optional binary CORP_CUST_NO (DECIMAL(20,0));
>   optional binary APLNDT (UTF8);
>   optional binary OPNDT (UTF8);
>   optional binary VERI_EMP_NO (DECIMAL(20,0));
>   optional binary VERI_SIGN (UTF8);
>   optional binary MANAGER_SIGN (UTF8);
>   optional binary CURBAL (DECIMAL(8,2));
>   optional binary STATUS (UTF8);
> }
> , metadata: {parquet.avro.schema={"type":"record","name":"acct_mstr","namespace"
> :"sbi","fields":[{"name":"ACCT_NO","type":{"type":"bytes","logicalType":"decimal
> ","precision":20,"scale":0,"cv_auto_incr":false,"cv_case_sensitive":false,"cv_co
> lumn_class":"java.math.BigDecimal","cv_connection":"oracle.jdbc.driver.T4CConnec
> tion","cv_currency":true,"cv_def_writable":false,"cv_nullable":0,"cv_precision":
> 20,"cv_read_only":false,"cv_scale":0,"cv_searchable":true,"cv_signed":true,"cv_s
> ubscript":1,"cv_type":2,"cv_typename":"NUMBER","cv_writable":true}},{"name":"SF_
> NO","type":["null",{"type":"string","cv_auto_incr":false,"cv_case_sensitive":tru
> e,"cv_column_class":"java.lang.String","cv_currency":false,"cv_def_writable":fal
> se,"cv_nullable":1,"cv_precision":10,"cv_read_only":false,"cv_scale":0,"cv_searc
> hable":true,"cv_signed":true,"cv_subscript":2,"cv_type":12,"cv_typename":"VARCHA
> R2","cv_writable":true}]},{"name":"LF_NO","type":["null",{"type":"string","cv_au
> to_incr":false,"cv_case_sensitive":true,"cv_column_class":"java.lang.String","cv
> _currency":false,"cv_def_writable":false,"cv_nullable":1,"cv_precision":10,"cv_r
> ead_only":false,"cv_scale":0,"cv_searchable":true,"cv_signed":true,"cv_subscript
> ":3,"cv_type":12,"cv_typename":"VARCHAR2","cv_writable":true}]},{"name":"BRANCH_
> NO","type":["null",{"type":"bytes","logicalType":"decimal","precision":20,"scale
> ":0,"cv_auto_incr":false,"cv_case_sensitive":false,"cv_column_class":"java.math.
> BigDecimal","cv_currency":true,"cv_def_writable":false,"cv_nullable":1,"cv_preci
> sion":20,"cv_read_only":false,"cv_scale":0,"cv_searchable":true,"cv_signed":true
> ,"cv_subscript":4,"cv_type":2,"cv_typename":"NUMBER","cv_writable":true}]},{"nam
> e":"INTRO_CUST_NO","type":["null",{"type":"bytes","logicalType":"decimal","preci
> sion":20,"scale":0,"cv_auto_incr":false,"cv_case_sensitive":false,"cv_column_cla
> ss":"java.math.BigDecimal","cv_currency":true,"cv_def_writable":false,"cv_nullab
> le":1,"cv_precision":20,"cv_read_only":false,"cv_scale":0,"cv_searchable":true,"
> cv_signed":true,"cv_subscript":5,"cv_type":2,"cv_typename":"NUMBER","cv_writable
> ":true}]},{"name":"INTRO_ACCT_NO","type":["null",{"type":"bytes","logicalType":"
> decimal","precision":20,"scale":0,"cv_auto_incr":false,"cv_case_sensitive":false
> ,"cv_column_class":"java.math.BigDecimal","cv_currency":true,"cv_def_writable":f
> alse,"cv_nullable":1,"cv_precision":20,"cv_read_only":false,"cv_scale":0,"cv_sea
> rchable":true,"cv_signed":true,"cv_subscript":6,"cv_type":2,"cv_typename":"NUMBE
> R","cv_writable":true}]},{"name":"INTRO_SIGN","type":["null",{"type":"string","c
> v_auto_incr":false,"cv_case_sensitive":true,"cv_column_class":"java.lang.String"
> ,"cv_currency":false,"cv_def_writable":false,"cv_nullable":1,"cv_precision":1,"c
> v_read_only":false,"cv_scale":0,"cv_searchable":true,"cv_signed":true,"cv_subscr
> ipt":7,"cv_type":12,"cv_typename":"VARCHAR2","cv_writable":true}]},{"name":"TYPE
> ","type":["null",{"type":"string","cv_auto_incr":false,"cv_case_sensitive":true,
> "cv_column_class":"java.lang.String","cv_currency":false,"cv_def_writable":false
> ,"cv_nullable":1,"cv_precision":2,"cv_read_only":false,"cv_scale":0,"cv_searchab
> le":true,"cv_signed":true,"cv_subscript":8,"cv_type":12,"cv_typename":"VARCHAR2"
> ,"cv_writable":true}]},{"name":"OPR_MODE","type":["null",{"type":"string","cv_au
> to_incr":false,"cv_case_sensitive":true,"cv_column_class":"java.lang.String","cv
> _currency":false,"cv_def_writable":false,"cv_nullable":1,"cv_precision":2,"cv_re
> ad_only":false,"cv_scale":0,"cv_searchable":true,"cv_signed":true,"cv_subscript"
> :9,"cv_type":12,"cv_typename":"VARCHAR2","cv_writable":true}]},{"name":"CUR_ACCT
> _TYPE","type":["null",{"type":"string","cv_auto_incr":false,"cv_case_sensitive":
> true,"cv_column_class":"java.lang.String","cv_currency":false,"cv_def_writable":
> false,"cv_nullable":1,"cv_precision":4,"cv_read_only":false,"cv_scale":0,"cv_sea
> rchable":true,"cv_signed":true,"cv_subscript":10,"cv_type":12,"cv_typename":"VAR
> CHAR2","cv_writable":true}]},{"name":"TITLE","type":["null",{"type":"string","cv
> _auto_incr":false,"cv_case_sensitive":true,"cv_column_class":"java.lang.String",
> "cv_currency":false,"cv_def_writable":false,"cv_nullable":1,"cv_precision":30,"c
> v_read_only":false,"cv_scale":0,"cv_searchable":true,"cv_signed":true,"cv_subscr
> ipt":11,"cv_type":12,"cv_typename":"VARCHAR2","cv_writable":true}]},{"name":"COR
> P_CUST_NO","type":["null",{"type":"bytes","logicalType":"decimal","precision":20
> ,"scale":0,"cv_auto_incr":false,"cv_case_sensitive":false,"cv_column_class":"jav
> a.math.BigDecimal","cv_currency":true,"cv_def_writable":false,"cv_nullable":1,"c
> v_precision":20,"cv_read_only":false,"cv_scale":0,"cv_searchable":true,"cv_signe
> d":true,"cv_subscript":12,"cv_type":2,"cv_typename":"NUMBER","cv_writable":true}
> ]},{"name":"APLNDT","type":["null",{"type":"string","cv_auto_incr":false,"cv_cas
> e_sensitive":false,"cv_column_class":"java.sql.Timestamp","cv_currency":false,"c
> v_def_writable":false,"cv_nullable":1,"cv_precision":0,"cv_read_only":false,"cv_
> scale":0,"cv_searchable":true,"cv_signed":true,"cv_subscript":13,"cv_type":93,"c
> v_typename":"DATE","cv_writable":true}]},{"name":"OPNDT","type":["null",{"type":
> "string","cv_auto_incr":false,"cv_case_sensitive":false,"cv_column_class":"java.
> sql.Timestamp","cv_currency":false,"cv_def_writable":false,"cv_nullable":1,"cv_p
> recision":0,"cv_read_only":false,"cv_scale":0,"cv_searchable":true,"cv_signed":t
> rue,"cv_subscript":14,"cv_type":93,"cv_typename":"DATE","cv_writable":true}]},{"
> name":"VERI_EMP_NO","type":["null",{"type":"bytes","logicalType":"decimal","prec
> ision":20,"scale":0,"cv_auto_incr":false,"cv_case_sensitive":false,"cv_column_cl
> ass":"java.math.BigDecimal","cv_currency":true,"cv_def_writable":false,"cv_nulla
> ble":1,"cv_precision":20,"cv_read_only":false,"cv_scale":0,"cv_searchable":true,
> "cv_signed":true,"cv_subscript":15,"cv_type":2,"cv_typename":"NUMBER","cv_writab
> le":true}]},{"name":"VERI_SIGN","type":["null",{"type":"string","cv_auto_incr":f
> alse,"cv_case_sensitive":true,"cv_column_class":"java.lang.String","cv_currency"
> :false,"cv_def_writable":false,"cv_nullable":1,"cv_precision":1,"cv_read_only":f
> alse,"cv_scale":0,"cv_searchable":true,"cv_signed":true,"cv_subscript":16,"cv_ty
> pe":12,"cv_typename":"VARCHAR2","cv_writable":true}]},{"name":"MANAGER_SIGN","ty
> pe":["null",{"type":"string","cv_auto_incr":false,"cv_case_sensitive":true,"cv_c
> olumn_class":"java.lang.String","cv_currency":false,"cv_def_writable":false,"cv_
> nullable":1,"cv_precision":1,"cv_read_only":false,"cv_scale":0,"cv_searchable":t
> rue,"cv_signed":true,"cv_subscript":17,"cv_type":12,"cv_typename":"VARCHAR2","cv
> _writable":true}]},{"name":"CURBAL","type":["null",{"type":"bytes","logicalType"
> :"decimal","precision":8,"scale":2,"cv_auto_incr":false,"cv_case_sensitive":fals
> e,"cv_column_class":"java.math.BigDecimal","cv_currency":true,"cv_def_writable":
> false,"cv_nullable":1,"cv_precision":8,"cv_read_only":false,"cv_scale":2,"cv_sea
> rchable":true,"cv_signed":true,"cv_subscript":18,"cv_type":2,"cv_typename":"NUMB
> ER","cv_writable":true}]},{"name":"STATUS","type":["null",{"type":"string","cv_a
> uto_incr":false,"cv_case_sensitive":true,"cv_column_class":"java.lang.String","c
> v_currency":false,"cv_def_writable":false,"cv_nullable":1,"cv_precision":1,"cv_r
> ead_only":false,"cv_scale":0,"cv_searchable":true,"cv_signed":true,"cv_subscript
> ":19,"cv_type":12,"cv_typename":"VARCHAR2","cv_writable":true}]}]}}}, blocks: [B
> lockMetaData{10, 1281 [ColumnMetaData{SNAPPY [ACCT_NO] BINARY  [BIT_PACKED, PLAI
> N], 4}, ColumnMetaData{SNAPPY [SF_NO] BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY
> ], 88}, ColumnMetaData{SNAPPY [LF_NO] BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY
> ], 163}, ColumnMetaData{SNAPPY [BRANCH_NO] BINARY  [RLE, BIT_PACKED, PLAIN_DICTI
> ONARY], 241}, ColumnMetaData{SNAPPY [INTRO_CUST_NO] BINARY  [RLE, BIT_PACKED, PL
> AIN_DICTIONARY], 298}, ColumnMetaData{SNAPPY [INTRO_ACCT_NO] BINARY  [RLE, BIT_P
> ACKED, PLAIN_DICTIONARY], 364}, ColumnMetaData{SNAPPY [INTRO_SIGN] BINARY  [RLE,
>  BIT_PACKED, PLAIN_DICTIONARY], 421}, ColumnMetaData{SNAPPY [TYPE] BINARY  [RLE,
>  BIT_PACKED, PLAIN_DICTIONARY], 478}, ColumnMetaData{SNAPPY [OPR_MODE] BINARY  [
> RLE, BIT_PACKED, PLAIN_DICTIONARY], 538}, ColumnMetaData{SNAPPY [CUR_ACCT_TYPE]
> BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY], 598}, ColumnMetaData{SNAPPY [TITLE]
>  BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY], 658}, ColumnMetaData{SNAPPY [CORP_
> CUST_NO] BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY], 736}, ColumnMetaData{SNAPP
> Y [APLNDT] BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY], 802}, ColumnMetaData{SNA
> PPY [OPNDT] BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY], 919}, ColumnMetaData{SN
> APPY [VERI_EMP_NO] BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY], 1036}, ColumnMet
> aData{SNAPPY [VERI_SIGN] BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY], 1093}, Col
> umnMetaData{SNAPPY [MANAGER_SIGN] BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY], 1
> 150}, ColumnMetaData{SNAPPY [CURBAL] BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONARY]
> , 1207}, ColumnMetaData{SNAPPY [STATUS] BINARY  [RLE, BIT_PACKED, PLAIN_DICTIONA
> RY], 1270}]}]}
>         at org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader
> .handleAndRaise(ParquetRecordReader.java:346)
>         at org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader
> .setup(ParquetRecordReader.java:339)
>         at org.apache.drill.exec.physical.impl.ScanBatch.<init>(ScanBatch.java:1
> 01)
>         at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(
> ParquetScanBatchCreator.java:168)
>         at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(
> ParquetScanBatchCreator.java:56)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCr
> eator.java:151)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreat
> or.java:174)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCr
> eator.java:131)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreat
> or.java:174)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCr
> eator.java:131)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreat
> or.java:174)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCr
> eator.java:131)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreat
> or.java:174)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCr
> eator.java:131)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreat
> or.java:174)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCr
> eator.java:131)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreat
> or.java:174)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreat
> or.java:105)
>         at org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.j
> ava:79)
>         at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExec
> utor.java:230)
>         at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable
> .java:38)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
> java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
> .java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: org.apache.drill.exec.vector.Decimal28SparseVector
cannot be cast to org.apache.drill.exec.vector.VariableWidthVector
>         at org.apache.drill.exec.store.parquet.columnreaders.VarLengthValuesColu
> mn.<init>(VarLengthValuesColumn.java:44)
>         at org.apache.drill.exec.store.parquet.columnreaders.VarLengthColumnRead
> ers$Decimal28Column.<init>(VarLengthColumnReaders.java:52)
>         at org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
> .getReader(ColumnReaderFactory.java:178)
>         at org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader
> .setup(ParquetRecordReader.java:319)
>         ... 22 more
> Error: SYSTEM ERROR: ClassCastException: org.apache.drill.exec.vector.Decimal28S
> parseVector cannot be cast to org.apache.drill.exec.vector.VariableWidthVector
> Fragment 0:0
> [Error Id: 22bfa8dd-1129-4300-9449-409e96d6c800 on DaveOshinsky-PC.gp.cv.commvau
> lt.com:31010] (state=,code=0)
> 0: jdbc:drill:zk=local> select count(*) from dfs.`c:/dao/DBArchivePredictor/tenr
> ows.parquet` where opr_mode='JO';
> +---------+
> | EXPR$0  |
> +---------+
> | 10      |
> +---------+
> 1 row selected (0.406 seconds)
> 0: jdbc:drill:zk=local>
> The immediate cause of this exception is that Drill, in org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader,
assumes that all BINARY values are encoded in VariableWidthVectors.  For BINARY DECIMAL, this
is not true, as for example Decimal28SparseVector is a FixedWidthVector, not a VariableWidthVector.
  The assumption that DECIMAL is not encoded in variable length BINARY is found in a number
of other places in the Drill code, including:
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory only contains logic
to handle DECIMAL with INT32, INT64, INT96, or FIXED_LEN_BYTE_ARRAY.  BINARY is not supported
with DECIMAL.
> org.apache.drill.exec.store.parquet.columnreaders.NullableFixedByteAlignedReaders does
not support a nullable reader for BINARY in getNullableColumnReader method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message