drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5795) Filter pushdown for parquet handles multi rowgroup file
Date Wed, 20 Sep 2017 17:24:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173534#comment-16173534
] 

ASF GitHub Bot commented on DRILL-5795:
---------------------------------------

Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/949#discussion_r140033471
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
---
    @@ -819,63 +827,64 @@ private void init() throws IOException {
               }
             }
             rowGroupInfo.setEndpointByteMap(endpointByteMap);
    +        rowGroupInfo.setColumns(rg.getColumns());
             rgIndex++;
             rowGroupInfos.add(rowGroupInfo);
           }
         }
     
         this.endpointAffinities = AffinityCreator.getAffinityMap(rowGroupInfos);
    +    updatePartitionColTypeMap();
    +  }
     
    +  private void updatePartitionColTypeMap() {
         columnValueCounts = Maps.newHashMap();
         this.rowCount = 0;
         boolean first = true;
    -    for (ParquetFileMetadata file : parquetTableMetadata.getFiles()) {
    -      for (RowGroupMetadata rowGroup : file.getRowGroups()) {
    -        long rowCount = rowGroup.getRowCount();
    -        for (ColumnMetadata column : rowGroup.getColumns()) {
    -          SchemaPath schemaPath = SchemaPath.getCompoundPath(column.getName());
    -          Long previousCount = columnValueCounts.get(schemaPath);
    -          if (previousCount != null) {
    -            if (previousCount != GroupScan.NO_COLUMN_STATS) {
    -              if (column.getNulls() != null) {
    -                Long newCount = rowCount - column.getNulls();
    -                columnValueCounts.put(schemaPath, columnValueCounts.get(schemaPath) +
newCount);
    -              }
    -            }
    -          } else {
    +    for (RowGroupInfo rowGroup : this.rowGroupInfos) {
    --- End diff --
    
    Isn't this doing the same thing as the original code? RowGroupInfos is built from the
RowGroupMetadata in the files?


> Filter pushdown for parquet handles multi rowgroup file
> -------------------------------------------------------
>
>                 Key: DRILL-5795
>                 URL: https://issues.apache.org/jira/browse/DRILL-5795
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Damien Profeta
>            Assignee: Damien Profeta
>              Labels: doc-impacting
>
> DRILL-1950 implemented the filter pushdown for parquet file but only in the case of one
rowgroup per parquet file. In the case of multiple rowgroups per files, it detects that the
rowgroup can be pruned but then tell to the drillbit to read the whole file which leads to
performance issue.
> Having multiple rowgroup per file helps to handle partitioned dataset and still read
only the relevant subset of data without ending with more file than really needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message