Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 357AE189D7 for ; Sat, 7 Nov 2015 03:29:11 +0000 (UTC) Received: (qmail 46457 invoked by uid 500); 7 Nov 2015 03:29:11 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 46426 invoked by uid 500); 7 Nov 2015 03:29:11 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 46417 invoked by uid 99); 7 Nov 2015 03:29:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Nov 2015 03:29:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id ED4652C14E1 for ; Sat, 7 Nov 2015 03:29:10 +0000 (UTC) Date: Sat, 7 Nov 2015 03:29:10 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-4048) Parquet reader corrupts dictionary encoded binary columns MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-4048?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1499= 4996#comment-14994996 ]=20 ASF GitHub Bot commented on DRILL-4048: --------------------------------------- GitHub user jaltekruse opened a pull request: https://github.com/apache/drill/pull/247 DRILL-4048: Fix reading required dictionary encoded varbinary data in= =E2=80=A6 =E2=80=A6 parquet files after recent update =20 Fix was small, this update is a little larger than necessary because I = was hoping to create a unit test by modifying the one I had added in the earlier patch with = the version upgrade. Unfortunately we don't have a good way to generate Parquet files with r= equired columns from unit tests right now. So I just added a smaller subset of the binary fi= le that was posted on the JIRA issue. The refactoring of the earlier test was still useful fo= r readability, so I kept it in. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaltekruse/incubator-drill DRILL-4048 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/247.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #247 =20 ---- commit e344a1fdf08192d6f3d18b09c1e7c3bcc478f518 Author: Jason Altekruse Date: 2015-11-07T03:24:28Z DRILL-4048: Fix reading required dictionary encoded varbinary data in p= arquet files after recent update =20 Fix was small, this update is a little larger than necessary because I = was hoping to create a unit test by modifying the one I had added in the earlier patch with = the version upgrade. Unfortunately we don't have a good way to generate Parquet files with r= equired columns from unit tests right now. So I just added a smaller subset of the binary fi= le that was posted on the JIRA issue. The refactoring of the earlier test was still useful fo= r readability, so I kept it in. ---- > Parquet reader corrupts dictionary encoded binary columns > --------------------------------------------------------- > > Key: DRILL-4048 > URL: https://issues.apache.org/jira/browse/DRILL-4048 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: 1.3.0 > Reporter: Rahul Challapalli > Assignee: Jason Altekruse > Priority: Blocker > Attachments: lineitem_dic_enc.parquet > > > git.commit.id.abbrev=3D04c01bd > The below query returns corrupted data (not even showing up here) for bin= ary columns > {code} > select * from `lineitem_dic_enc.parquet` limit 1; > +-------------+------------+------------+---------------+-------------+--= ----------------+-------------+--------+---------------+---------------+---= ----------+---------------+----------------+--------------------+----------= ---+--------------------------+ > | l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l= _extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_= shipdate | l_commitdate | l_receiptdate | l_shipinstruct | l_shipmod= e | l_comment | > +-------------+------------+------------+---------------+-------------+--= ----------------+-------------+--------+---------------+---------------+---= ----------+---------------+----------------+--------------------+----------= ---+--------------------------+ > | 1 | 1552 | 93 | 1 | 17.0 | 2= 4710.35 | 0.04 | 0.02 | =01 | =01 = | 1996-03-13 | 1996-02-12 | 1996-03-22 | =11DELIVER IN PE | T = | egular courts above the | > +-------------+------------+------------+---------------+-------------+--= ----------------+-------------+--------+---------------+---------------+---= ----------+---------------+----------------+--------------------+----------= ---+--------------------------+ > {code} > The same query from an older build (git.commit.id.abbrev=3D839f8da) > {code} > select * from `lineitem_dic_enc.parquet` limit 1; > +-------------+------------+------------+---------------+-------------+--= ----------------+-------------+--------+---------------+---------------+---= ----------+---------------+----------------+--------------------+----------= ---+--------------------------+ > | l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l= _extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_= shipdate | l_commitdate | l_receiptdate | l_shipinstruct | l_shipmod= e | l_comment | > +-------------+------------+------------+---------------+-------------+--= ----------------+-------------+--------+---------------+---------------+---= ----------+---------------+----------------+--------------------+----------= ---+--------------------------+ > | 1 | 1552 | 93 | 1 | 17.0 | 2= 4710.35 | 0.04 | 0.02 | N | O | 19= 96-03-13 | 1996-02-12 | 1996-03-22 | DELIVER IN PERSON | TRUCK = | egular courts above the | > +-------------+------------+------------+---------------+-------------+--= ----------------+-------------+--------+---------------+---------------+---= ----------+---------------+----------------+--------------------+----------= ---+--------------------------+ > {code} > Below is the output of the parquet-meta command for this dataset > {code} > creator: parquet-mr=20 > file schema: root=20 > -------------------------------------------------------------------------= ---------------------------------------------------------------------------= --------------------------------------------------------------- > l_orderkey: REQUIRED INT32 R:0 D:0 > l_partkey: REQUIRED INT32 R:0 D:0 > l_suppkey: REQUIRED INT32 R:0 D:0 > l_linenumber: REQUIRED INT32 R:0 D:0 > l_quantity: REQUIRED DOUBLE R:0 D:0 > l_extendedprice: REQUIRED DOUBLE R:0 D:0 > l_discount: REQUIRED DOUBLE R:0 D:0 > l_tax: REQUIRED DOUBLE R:0 D:0 > l_returnflag: REQUIRED BINARY O:UTF8 R:0 D:0 > l_linestatus: REQUIRED BINARY O:UTF8 R:0 D:0 > l_shipdate: REQUIRED INT32 O:DATE R:0 D:0 > l_commitdate: REQUIRED INT32 O:DATE R:0 D:0 > l_receiptdate: REQUIRED INT32 O:DATE R:0 D:0 > l_shipinstruct: REQUIRED BINARY O:UTF8 R:0 D:0 > l_shipmode: REQUIRED BINARY O:UTF8 R:0 D:0 > l_comment: REQUIRED BINARY O:UTF8 R:0 D:0 > row group 1: RC:60175 TS:3049610=20 > -------------------------------------------------------------------------= ---------------------------------------------------------------------------= --------------------------------------------------------------- > l_orderkey: INT32 SNAPPY DO:0 FPO:4 SZ:146159/165487/1.13 VC:60175 = ENC:BIT_PACKED,PLAIN_DICTIONARY > l_partkey: INT32 SNAPPY DO:0 FPO:146163 SZ:90867/90918/1.00 VC:601= 75 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_suppkey: INT32 SNAPPY DO:0 FPO:237030 SZ:53244/53230/1.00 VC:601= 75 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_linenumber: INT32 SNAPPY DO:0 FPO:290274 SZ:14909/22767/1.53 VC:601= 75 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_quantity: DOUBLE SNAPPY DO:0 FPO:305183 SZ:45536/45715/1.00 VC:60= 175 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_extendedprice: DOUBLE SNAPPY DO:0 FPO:350719 SZ:327454/407907/1.25 VC:= 60175 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_discount: DOUBLE SNAPPY DO:0 FPO:678173 SZ:30349/30359/1.00 VC:60= 175 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_tax: DOUBLE SNAPPY DO:0 FPO:708522 SZ:30334/30342/1.00 VC:60= 175 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_returnflag: BINARY SNAPPY DO:0 FPO:738856 SZ:14700/14714/1.00 VC:60= 175 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_linestatus: BINARY SNAPPY DO:0 FPO:753556 SZ:8964/9506/1.06 VC:6017= 5 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_shipdate: INT32 SNAPPY DO:0 FPO:762520 SZ:100537/100514/1.00 VC:6= 0175 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_commitdate: INT32 SNAPPY DO:0 FPO:863057 SZ:100314/100282/1.00 VC:6= 0175 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_receiptdate: INT32 SNAPPY DO:0 FPO:963371 SZ:100584/100558/1.00 VC:6= 0175 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_shipinstruct: BINARY SNAPPY DO:0 FPO:1063955 SZ:15311/15303/1.00 VC:6= 0175 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_shipmode: BINARY SNAPPY DO:0 FPO:1079266 SZ:22800/22797/1.00 VC:6= 0175 ENC:BIT_PACKED,PLAIN_DICTIONARY > l_comment: BINARY SNAPPY DO:0 FPO:1102066 SZ:795339/1839211/2.31 V= C:60175 ENC:PLAIN,BIT_PACKED > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)