Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DE9B4200CC1 for ; Mon, 10 Jul 2017 10:41:45 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id DB1B4164167; Mon, 10 Jul 2017 08:41:45 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 062E21627BD for ; Mon, 10 Jul 2017 10:41:44 +0200 (CEST) Received: (qmail 93683 invoked by uid 500); 10 Jul 2017 08:41:44 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 93645 invoked by uid 99); 10 Jul 2017 08:41:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jul 2017 08:41:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D34FF1921AC for ; Mon, 10 Jul 2017 08:41:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id TnYp-KdTDH06 for ; Mon, 10 Jul 2017 08:41:31 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id B6025626DF for ; Mon, 10 Jul 2017 08:41:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 0433DE0D93 for ; Mon, 10 Jul 2017 08:41:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1F668246A0 for ; Mon, 10 Jul 2017 08:41:00 +0000 (UTC) Date: Mon, 10 Jul 2017 08:41:00 +0000 (UTC) From: "Volodymyr Vysotskyi (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-4139) Fix parquet partition pruning for BIT, INTERVAL and DECIMAL types MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 10 Jul 2017 08:41:46 -0000 [ https://issues.apache.org/jira/browse/DRILL-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080025#comment-16080025 ] Volodymyr Vysotskyi commented on DRILL-4139: -------------------------------------------- Drill serializes values of binary fields to parquet metadata cache file using the code {{new String(((Binary) bytes).getBytes())}} but when bytes has encoding that differs from default, for example it has little-endian byte order, then {{new String(((Binary) bytes).getBytes()).getBytes()}} would return byte array that differs from the {{bytes}}. According to [Parquet Logical Type Definitions|https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md], big-endian byte order should be used to store DECIMAL values in fixed_len_byte_array or binary field. INTERVAL type uses little-endian byte order to store its value in fixed_len_byte_array field. Drill stores correctly only values of binary fields in parquet metadata cache file, but values of fixed_len_byte_array fields are storing as Binary objects: {noformat} { "name" : [ "col_intrvl_yr" ], "minValue" : { "bytesUnsafe" : "sQAAAAAAAAAAAAAA", "bytes" : "sQAAAAAAAAAAAAAA", "backingBytesReused" : true }, "maxValue" : { "bytesUnsafe" : "OgEAAAAAAAAAAAAA", "bytes" : "OgEAAAAAAAAAAAAA", "backingBytesReused" : true }, "nulls" : 0 } {noformat} Since Drill may store some types in binary and fixed_len_byte_array fields, it is required to serialize / deserialize both these types by the same way. For example according to [Parquet Logical Type Definitions|https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md], DECIMAL field may be stored as binary or fixed_len_byte_array field. Proposal is to serialize byte arrays directly by calling {{((Binary) value.minValue).getBytes()}} and deserialize by calling {{Base64.decodeBase64(((String) source).getBytes())}}. So there will be no dependence on the byte order. Another problem is backward compatibility. When metadata file, that created by the version of Drill with these changes will be read from older Drill version, it may lead to errors or wrong results. Updating the metadata version does not help, since old Drill versions just throws an exception when is trying to read new metadata cache files: {noformat} Error: SYSTEM ERROR: JsonMappingException: Could not resolve type id 'v4' into a subtype of [simple type, class org.apache.drill.exec.store.parquet.Metadata$ParquetTableMetadataBase]: known type ids = [Metadata$ParquetTableMetadataBase, v1, v2, v3] at [Source: org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream@7b609ce0; line: 2, column: 24] {noformat} Metadata cache files without and with changes for DRILL-4139 attached to the Jira. Drill version with changes for this Jira allows to read parquet table metadata cache with version v3 and older. Drill 1.10.0 will throw an exception when it will try to read parquet table metadata cache with version v4 and greater. > Fix parquet partition pruning for BIT, INTERVAL and DECIMAL types > ----------------------------------------------------------------- > > Key: DRILL-4139 > URL: https://issues.apache.org/jira/browse/DRILL-4139 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: 1.3.0 > Environment: 4 node cluster on CentOS > Reporter: Khurram Faraaz > Assignee: Volodymyr Vysotskyi > > Exception while trying to prune partition. > java.lang.UnsupportedOperationException: Unsupported type: BIT > is seen in drillbit.log after Functional run on 4 node cluster. > Drill 1.3.0 sys.version => d61bb83a8 > {code} > 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO o.a.d.e.p.l.partition.PruneScanRule - Beginning partition pruning, pruning class: org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2 > 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO o.a.d.e.p.l.partition.PruneScanRule - Total elapsed time to build and analyze filter tree: 0 ms > 2015-11-27 03:12:19,810 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] WARN o.a.d.e.p.l.partition.PruneScanRule - Exception while trying to prune partition. > java.lang.UnsupportedOperationException: Unsupported type: BIT > at org.apache.drill.exec.store.parquet.ParquetGroupScan.populatePruningVector(ParquetGroupScan.java:479) ~[drill-java-exec-1.3.0.jar:1.3.0] > at org.apache.drill.exec.planner.ParquetPartitionDescriptor.populatePartitionVectors(ParquetPartitionDescriptor.java:96) ~[drill-java-exec-1.3.0.jar:1.3.0] > at org.apache.drill.exec.planner.logical.partition.PruneScanRule.doOnMatch(PruneScanRule.java:235) ~[drill-java-exec-1.3.0.jar:1.3.0] > at org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2.onMatch(ParquetPruneScanRule.java:87) [drill-java-exec-1.3.0.jar:1.3.0] > at org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:228) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8] > at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8] > at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8] > at org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:303) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8] > at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.logicalPlanningVolcanoAndLopt(DefaultSqlHandler.java:545) [drill-java-exec-1.3.0.jar:1.3.0] > at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:213) [drill-java-exec-1.3.0.jar:1.3.0] > at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:248) [drill-java-exec-1.3.0.jar:1.3.0] > at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan(DefaultSqlHandler.java:164) [drill-java-exec-1.3.0.jar:1.3.0] > at org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:184) [drill-java-exec-1.3.0.jar:1.3.0] > at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:905) [drill-java-exec-1.3.0.jar:1.3.0] > at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:244) [drill-java-exec-1.3.0.jar:1.3.0] > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45] > at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)