crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-299) Support predicate pushdown for Parquet sources
Date Fri, 22 Nov 2013 17:24:35 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830127#comment-13830127
] 

Gabriel Reid commented on CRUNCH-299:
-------------------------------------

FWIW, option A (giving a ColumnRecordFilter to Parquet Source) is in line with what we do
in HBase right now, i.e. you can provide a Scan object that does whatever kind of filtering
you want.

I can imagine that if we had some kind of common FieldValuePredicateFilterFn class, we could
potentially push it down (or up) to the source, which could choose to attempt to use it, something
like this:

{code}
    filteredCollection = collection.filter(new FieldValuePredicateFilterFn("make", eq("Volkswagen"));
{code}

This could then pushed up to the source and be interpreted by an HBaseSource as "add an equality
filter to the scan on values of the 'make' column family", and interpreted by the Parquet
Source as "create a ColumnRecordFilter on the 'make' column". Obviously in the cases of other
sources (e.g. text) it would just be ignored, and the filter could be executed as usual (which
I guess would mean using reflection to extract the field value). There are cases where that
wouldn't work well that I can think of, and probably a lot more that I can't think of. Stuff
like this is probably a lot easier in cases like Pig, Hive, and Cascading where you know that
the values passing through the pipeline are all tuples.

That being said, I think that this also opens the discussion of how "smart" we want Crunch
to be, or how much we want to leave optimization things like that up to the user. A similar
discussion is the idea of letting Crunch automatically choose a join strategy based on observations
about the data.



> Support predicate pushdown for Parquet sources
> ----------------------------------------------
>
>                 Key: CRUNCH-299
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-299
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Tom White
>            Assignee: Josh Wills
>
> We should be able to push Crunch FilterFn down to a Parquet ColumnRecordFilter. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message