drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Volodymyr Vysotskyi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4139) Fix parquet partition pruning for BIT, INTERVAL and DECIMAL types
Date Mon, 10 Jul 2017 08:41:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080025#comment-16080025

Volodymyr Vysotskyi commented on DRILL-4139:

Drill serializes values of binary fields to parquet metadata cache file using the code {{new
String(((Binary) bytes).getBytes())}}
but when bytes has encoding that differs from default, for example it has little-endian byte
order, then {{new String(((Binary) bytes).getBytes()).getBytes()}}
would return byte array that differs from the {{bytes}}. 
According to [Parquet Logical Type Definitions|https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md],
big-endian byte order should be used to store DECIMAL values in fixed_len_byte_array or binary
field. INTERVAL type uses little-endian byte order to store its value in fixed_len_byte_array
Drill stores correctly only values of binary fields in parquet metadata cache file, but values
of fixed_len_byte_array fields are storing as Binary objects:
        "name" : [ "col_intrvl_yr" ],
        "minValue" : {
          "bytesUnsafe" : "sQAAAAAAAAAAAAAA",
          "bytes" : "sQAAAAAAAAAAAAAA",
          "backingBytesReused" : true
        "maxValue" : {
          "bytesUnsafe" : "OgEAAAAAAAAAAAAA",
          "bytes" : "OgEAAAAAAAAAAAAA",
          "backingBytesReused" : true
        "nulls" : 0
Since Drill may store some types in binary and fixed_len_byte_array fields, it is required
to serialize / deserialize both these types by the same way. For example according to [Parquet
Logical Type Definitions|https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md],
DECIMAL field may be stored as binary or fixed_len_byte_array field.

Proposal is to serialize byte arrays directly by calling {{((Binary) value.minValue).getBytes()}}
and deserialize by calling {{Base64.decodeBase64(((String) source).getBytes())}}.
So there will be no dependence on the byte order.

Another problem is backward compatibility. When metadata file, that created by the version
of Drill with these changes will be read from older Drill version, it may lead to errors or
wrong results. Updating the metadata version does not help, since old Drill versions just
throws an exception when is trying to read new metadata cache files:
Error: SYSTEM ERROR: JsonMappingException: Could not resolve type id 'v4' into a subtype of
[simple type, class org.apache.drill.exec.store.parquet.Metadata$ParquetTableMetadataBase]:
known type ids = [Metadata$ParquetTableMetadataBase, v1, v2, v3]
 at [Source: org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream@7b609ce0; line:
2, column: 24]

Metadata cache files without and with changes for DRILL-4139 attached to the Jira.

Drill version with changes for this Jira allows to read parquet table metadata cache with
version v3 and older. 
Drill 1.10.0 will throw an exception when it will try to read parquet table metadata cache
with version v4 and greater.

> Fix parquet partition pruning for BIT, INTERVAL and DECIMAL types
> -----------------------------------------------------------------
>                 Key: DRILL-4139
>                 URL: https://issues.apache.org/jira/browse/DRILL-4139
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.3.0
>         Environment: 4 node cluster on CentOS
>            Reporter: Khurram Faraaz
>            Assignee: Volodymyr Vysotskyi
> Exception while trying to prune partition.
> java.lang.UnsupportedOperationException: Unsupported type: BIT
> is seen in drillbit.log after Functional run on 4 node cluster.
> Drill 1.3.0 sys.version => d61bb83a8
> {code}
> 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO  o.a.d.e.p.l.partition.PruneScanRule
- Beginning partition pruning, pruning class: org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2
> 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO  o.a.d.e.p.l.partition.PruneScanRule
- Total elapsed time to build and analyze filter tree: 0 ms
> 2015-11-27 03:12:19,810 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] WARN  o.a.d.e.p.l.partition.PruneScanRule
- Exception while trying to prune partition.
> java.lang.UnsupportedOperationException: Unsupported type: BIT
>         at org.apache.drill.exec.store.parquet.ParquetGroupScan.populatePruningVector(ParquetGroupScan.java:479)
>         at org.apache.drill.exec.planner.ParquetPartitionDescriptor.populatePartitionVectors(ParquetPartitionDescriptor.java:96)
>         at org.apache.drill.exec.planner.logical.partition.PruneScanRule.doOnMatch(PruneScanRule.java:235)
>         at org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2.onMatch(ParquetPruneScanRule.java:87)
>         at org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:228)
>         at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808)
>         at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:303) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.logicalPlanningVolcanoAndLopt(DefaultSqlHandler.java:545)
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:213)
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:248)
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan(DefaultSqlHandler.java:164)
>         at org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:184)
>         at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:905) [drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:244) [drill-java-exec-1.3.0.jar:1.3.0]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
> {code}

This message was sent by Atlassian JIRA

View raw message