drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-4139) Fix parquet partition pruning for BIT, INTERVAL and DECIMAL types
Date Mon, 10 Jul 2017 18:40:02 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080834#comment-16080834
] 

Paul Rogers edited comment on DRILL-4139 at 7/10/17 6:39 PM:
-------------------------------------------------------------

First impression is that we are serializing bytes all wrong. Bytes are not characers. Bytes
span the full range from 0-255. Since our JSON is Unicode, encoded as UTF-8, some combination
of bytes will be interpreted as multi-byte characters in Unicode. That is, we are abusing
the software.

Correct format for bytes is using a binary format. Most basic:

{code}
'000AFF132D'
{code}

We interpret the above as two hex digits per byte, in left-to-right (lowest to highest) address
order.

The Internet has a number of ways to store binary data in a more compact form. Base64 (RFC-4648)
is popular and has built-in support in Java (the {{Base64.Encoder}}) class. For example, here
is is an example Base64 string:

{code}
'Vm9sb2R5bXlyIFZ5c290c2t5aQ=='
{code}

Base64 has the advantage that it is designed to be broken into lines, which can be encoded
in JSON as an array.

Note that, for encoding purposes, we only care about the byte order in the buffer: left-to-right.
The meaning of those bytes is unimportant for serialization. That is, whether the data is
big-endian, little-endian, stream-of-bytes, or stream-of-multi-byte characters is important
to the code that interprets the (decoded) bytes, but not to the serialization format.



was (Author: paul-rogers):
First impression is that we are serializing bytes all wrong. Bytes are not characers. Bytes
span the full range from 0-255. Since our JSON is Unicode, encoded as UTF-8, some combination
of bytes will be interpreted as multi-byte characters in Unicode. That is, we are abusing
the software.

Correct format for bytes is using a binary format. Most basic:

{code}
'000AFF132D'
{code}

We interpret the above as two hex digits per byte, in left-to-right (lowest to highest) address
order.

The Internet has a number of ways to store binary data in a more compact form. Base64 (RFC-4648)
is popular and has built-in support in Java (the {{Base64.Encoder}}) class. For example, here
is is an example Base64 string:

'Vm9sb2R5bXlyIFZ5c290c2t5aQ=='

Base64 has the advantage that it is designed to be broken into lines, which can be encoded
in JSON as an array.

Note that, for encoding purposes, we only care about the byte order in the buffer: left-to-right.
The meaning of those bytes is unimportant for serialization. That is, whether the data is
big-endian, little-endian, stream-of-bytes, or stream-of-multi-byte characters is important
to the code that interprets the (decoded) bytes, but not to the serialization format.


> Fix parquet partition pruning for BIT, INTERVAL and DECIMAL types
> -----------------------------------------------------------------
>
>                 Key: DRILL-4139
>                 URL: https://issues.apache.org/jira/browse/DRILL-4139
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.3.0
>         Environment: 4 node cluster on CentOS
>            Reporter: Khurram Faraaz
>            Assignee: Volodymyr Vysotskyi
>         Attachments: metadata file v3, metadata file with changes
>
>
> Exception while trying to prune partition.
> java.lang.UnsupportedOperationException: Unsupported type: BIT
> is seen in drillbit.log after Functional run on 4 node cluster.
> Drill 1.3.0 sys.version => d61bb83a8
> {code}
> 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO  o.a.d.e.p.l.partition.PruneScanRule
- Beginning partition pruning, pruning class: org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2
> 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO  o.a.d.e.p.l.partition.PruneScanRule
- Total elapsed time to build and analyze filter tree: 0 ms
> 2015-11-27 03:12:19,810 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] WARN  o.a.d.e.p.l.partition.PruneScanRule
- Exception while trying to prune partition.
> java.lang.UnsupportedOperationException: Unsupported type: BIT
>         at org.apache.drill.exec.store.parquet.ParquetGroupScan.populatePruningVector(ParquetGroupScan.java:479)
~[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.ParquetPartitionDescriptor.populatePartitionVectors(ParquetPartitionDescriptor.java:96)
~[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.logical.partition.PruneScanRule.doOnMatch(PruneScanRule.java:235)
~[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2.onMatch(ParquetPruneScanRule.java:87)
[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:228)
[calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808)
[calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:303) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.logicalPlanningVolcanoAndLopt(DefaultSqlHandler.java:545)
[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:213)
[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:248)
[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan(DefaultSqlHandler.java:164)
[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:184)
[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:905) [drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:244) [drill-java-exec-1.3.0.jar:1.3.0]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_45]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_45]
>         at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message