spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <>
Subject Re: Dataframes: PrunedFilteredScan without Spark Side Filtering
Date Sun, 27 Sep 2015 21:08:31 GMT
We have to try and maintain binary compatibility here, so probably the
easiest thing to do here would be to add a method to the class.  Perhaps
something like:

def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters

By default, this could return all filters so behavior would remain the
same, but specific implementations could override it.  There is still a
chance that this would conflict with existing methods, but hopefully that
would not be a problem in practice.



On Fri, Sep 25, 2015 at 10:02 PM, Russell Spitzer <
> wrote:

> Hi! First time poster, long time reader.
> I'm wondering if there is a way to let cataylst know that it doesn't need
> to repeat a filter on the spark side after a filter has been applied by the
> Source Implementing PrunedFilterScan.
> This is for a usecase in which we except a filter on a non-existant column
> that serves as an entry point for our integration with a different system.
> While the source can correctly deal with this, the secondary filter done on
> the RDD itself wipes out the results because the column being filtered does
> not exist.
> In particular this is with our integration with Solr where we allow users
> to pass in a predicate based on "solr_query" ala ("where solr_query='*:*')
> there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*')
> filters out all of the data since no row's will have that column.
> I'm thinking about a few solutions to this but they all seem a little hacky
> 1) Try to manually remove the filter step from the query plan after our
> source handles the filter
> 2) Populate the solr_query field being returned so they all automatically
> pass
> But I think the real solution is to add a way to create a PrunedFilterScan
> which does not reapply filters if the source doesn't want it to. IE Giving
> PrunedFilterScan the ability to trust the underlying source that the filter
> will be accurately applied. Maybe changing the api to
> PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter],
> reapply: Boolean = true)
> Where Catalyst can check the Reapply value and not add an RDD.filter if it
> is false.
> Thoughts?
> Thanks for your time,
> Russ

View raw message