drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-1950) Implement filter pushdown for Parquet
Date Tue, 05 May 2015 13:36:37 GMT

     [ https://issues.apache.org/jira/browse/DRILL-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jacques Nadeau updated DRILL-1950:
    Fix Version/s:     (was: 1.0.0)

> Implement filter pushdown for Parquet
> -------------------------------------
>                 Key: DRILL-1950
>                 URL: https://issues.apache.org/jira/browse/DRILL-1950
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Jason Altekruse
>            Assignee: Jacques Nadeau
>             Fix For: 1.2.0
>         Attachments: DRILL-1950.1.patch.txt
> The parquet reader currently supports project pushdown, for limiting the number of columns
read, however it does not use filter pushdown to read a subset of the requested columns. This
is particularly useful with parquet files that contain statistics, most importantly min and
max values on pages. Evaluating predicates against these values could save some major reading
and decoding time.
> The largest barrier to implementing this is the current design of the reader. Firstly,
we currently have two separate parquet readers, one for reading flat files very quickly and
another or reading complex data. There are enhancements we can make the the flat reader, to
make it support nested data in a much more efficient manner. However the speed of the flat
file reader currently comes from being able to make vectorized copies out the the parquet
file. This design is somewhat at odds with filter pushdown, as we will only can make useful
vectorized copies if the filter matches a large run of values within the file. This might
not be too rare a case, assuming files are often somewhat sorted on a primary field like date
or a numeric key, and these are often fields used to limit the query to a subset of the data.
However for cases where we are filter out a few records here and there, we should just make
individual copies.
> We need to do more design work on the best way to balance performance with these use
cases in mind.

This message was sent by Atlassian JIRA

View raw message