drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5795) Filter pushdown for parquet handles multi rowgroup file
Date Mon, 09 Oct 2017 17:24:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197343#comment-16197343
] 

ASF GitHub Bot commented on DRILL-5795:
---------------------------------------

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/949
  
    This change causes one of our functional tests to fail. We will have to track down the
issue and either update the test, or post the problem here.


> Filter pushdown for parquet handles multi rowgroup file
> -------------------------------------------------------
>
>                 Key: DRILL-5795
>                 URL: https://issues.apache.org/jira/browse/DRILL-5795
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.11.0
>            Reporter: Damien Profeta
>            Assignee: Damien Profeta
>              Labels: doc-impacting, ready-to-commit
>             Fix For: 1.12.0
>
>         Attachments: multirowgroup_overlap.parquet
>
>
> DRILL-1950 implemented the filter pushdown for parquet file but only in the case of one
rowgroup per parquet file. In the case of multiple rowgroups per files, it detects that the
rowgroup can be pruned but then tell to the drillbit to read the whole file which leads to
performance issue.
> Having multiple rowgroup per file helps to handle partitioned dataset and still read
only the relevant subset of data without ending with more file than really needed.
> Let's say for instance you have a Parquet file composed of RG1 and RG2 with only one
column a. Min/max in RG1 are 1-2 and min/max in RG2 are 2-3.
> If I do "select a from file where a=3", today it will read the whole file, with the patch
it will only read RG2.
> *For documentation*
> Support / Other section in https://drill.apache.org/docs/parquet-filter-pushdown/ should
be updated.
> After the fix files with multiple row groups will be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message