spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomasz Gawęda (JIRA) <>
Subject [jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown
Date Tue, 24 Jul 2018 11:19:00 GMT


Tomasz Gawęda commented on SPARK-24288:

[~smilegator] Yes, you are right. If we don't want to use barriers (as mentioned by [~rxin]
in mail), we can add option to disable predicate pushdown for JDBC source. I've also though
about adding custom optimizer rule, but probably I'm not good enough in Spark internals yet

> Enable preventing predicate pushdown
> ------------------------------------
>                 Key: SPARK-24288
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Tomasz Gawęda
>            Priority: Major
>         Attachments: SPARK-24288.simple.patch
> Issue discussed on Mailing List: []
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df ="jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark will support
preventing predicate pushdown by user.
> For example: df.withAnalysisBarrier().where(...) ?
> Note, that this should not be a global configuration option. If I read 2 DataFrames,
df1 and df2, I would like to specify that df1 should not have some predicates pushed down,
but some may be, but df2 should have all predicates pushed down, even if target query joins
df1 and df2. As far as I understand Spark optimizer, if we use functions like `withAnalysisBarrier`
and put AnalysisBarrier explicitly in logical plan, then predicates won't be pushed down on
this particular DataFrames and PP will be still possible on the second one.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message