drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns
Date Fri, 06 Oct 2017 18:30:31 GMT
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143264522
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan
rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, List<SchemaPath> columns)
{
    +    if (Utilities.isStarQuery(columns)) {
    +      MessageType schema = footer.getFileMetaData().getSchema();
     
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +      for (Type type : schema.getFields()) {
    +        if (!type.isPrimitive()) {
    +          return true;
    +        }
           }
    -    }
    -    for (ColumnDescriptor col : schema.getColumns()) {
    -      if (col.getMaxRepetitionLevel() > 0) {
    -        return true;
    +      for (ColumnDescriptor col : schema.getColumns()) {
    +        if (col.getMaxRepetitionLevel() > 0) {
    +          return true;
    +        }
    +      }
    +      return false;
    +    } else {
    +      for (SchemaPath column : columns) {
    +        if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) {
    +          return true;
    +        }
           }
    +      return false;
    +    }
    +  }
    +
    +  private static boolean isColumnComplex(GroupType grouptype, SchemaPath column) {
    +    PathSegment.NameSegment root = column.getRootSegment();
    +    if (!grouptype.containsField(root.getPath().toLowerCase())) {
    +      return false;
    +    }
    +    Type type = grouptype.getType(root.getPath().toLowerCase());
    --- End diff --
    
    What are the semantics of `getType()`? Does it return null if there is no such type? If
so, then we can retrieve the type and check if it is null. Otherwise, if it throws an exception,
perhaps we can catch that. Either way, we avoid two searches for the same column and two conversions
of the path to lower case.
    
    Note also that a recent change deprecated `getPath()`. The preferred form is `getName()`
so it is clear that we are getting a single name part. If the Parquet column is nested (a.b.c,
say), then we have to explicitly handle the multiple name parts as Drill does support names
with dots and one cannot simply concatenate or split a path to get a string. That is, `["a.b",
"c"]` is two fields, `["a", "b", "c"]` is three, but both are represented as a full path as
"a.b.c", creating an ambiguity.


---

Mime
View raw message