drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5795) Filter pushdown for parquet handles multi rowgroup file
Date Tue, 19 Sep 2017 17:17:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172040#comment-16172040
] 

ASF GitHub Bot commented on DRILL-5795:
---------------------------------------

GitHub user dprofeta opened a pull request:

    https://github.com/apache/drill/pull/949

    DRILL-5795: Parquet Filter push down at rowgroup level

    Before this commit, the filter was pruning complete files. When a file
    is composed of multiple rowgroups, it was not able to prune one
    rowgroup from the file. Now, when the filter find that a rowgroup
    doesn't match it will be remove from the scan.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dprofeta/drill drill-5795

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/949.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #949
    
----
commit eed3395647b10d06edf86ba4378995e9fd8da83d
Author: Damien Profeta <damien.profeta@amadeus.com>
Date:   2017-09-15T18:01:58Z

    Parquet Filter push down now work at rowgroup level
    
    Before this commit, the filter was pruning complete files. When a file
    is composed of multiple rowgroups, it was not able to prune one
    rowgroup from the file. Now, when the filter find that a rowgroup
    doesn't match it will be remove from the scan.

----


> Filter pushdown for parquet handles multi rowgroup file
> -------------------------------------------------------
>
>                 Key: DRILL-5795
>                 URL: https://issues.apache.org/jira/browse/DRILL-5795
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Damien Profeta
>
> DRILL-1950 implemented the filter pushdown for parquet file but only in the case of one
rowgroup per parquet file. In the case of multiple rowgroups per files, it detects that the
rowgroup can be pruned but then tell to the drillbit to read the whole file which leads to
performance issue.
> Having multiple rowgroup per file helps to handle partitioned dataset and still read
only the relevant subset of data without ending with more file than really needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message