drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5797) Use more often the new parquet reader
Date Mon, 09 Oct 2017 08:11:02 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196621#comment-16196621
] 

ASF GitHub Bot commented on DRILL-5797:
---------------------------------------

Github user dprofeta commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143403559
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan
rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, List<SchemaPath> columns)
{
    +    if (Utilities.isStarQuery(columns)) {
    +      MessageType schema = footer.getFileMetaData().getSchema();
     
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +      for (Type type : schema.getFields()) {
    +        if (!type.isPrimitive()) {
    +          return true;
    +        }
           }
    -    }
    -    for (ColumnDescriptor col : schema.getColumns()) {
    -      if (col.getMaxRepetitionLevel() > 0) {
    -        return true;
    +      for (ColumnDescriptor col : schema.getColumns()) {
    +        if (col.getMaxRepetitionLevel() > 0) {
    +          return true;
    +        }
    +      }
    +      return false;
    +    } else {
    +      for (SchemaPath column : columns) {
    +        if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) {
    +          return true;
    +        }
           }
    +      return false;
    +    }
    +  }
    +
    +  private static boolean isColumnComplex(GroupType grouptype, SchemaPath column) {
    +    PathSegment.NameSegment root = column.getRootSegment();
    +    if (!grouptype.containsField(root.getPath().toLowerCase())) {
    +      return false;
    +    }
    +    Type type = grouptype.getType(root.getPath().toLowerCase());
    +    if (type.isRepetition(Type.Repetition.REPEATED) || !type.isPrimitive()) {
    --- End diff --
    
    Yes, sure. I wanted to check it in a loop first, but ParquetRecordReader doesn't handle
any nested type, so the loop is not needed now. But I didn't refactor enough.


> Use more often the new parquet reader
> -------------------------------------
>
>                 Key: DRILL-5797
>                 URL: https://issues.apache.org/jira/browse/DRILL-5797
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Damien Profeta
>            Assignee: Damien Profeta
>             Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based of what
type of columns is in the file. But the columns that are read by the query doesn't matter.
We can increase a little bit the cases where the optimized reader is used by checking is the
projected column are simple or not.
> This is an optimization waiting for the fast parquet reader to handle complex structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message