spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-9814) EqualNullSafe not passing to data sources
Date Sat, 15 Aug 2015 04:07:45 GMT

     [ https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon updated SPARK-9814:
--------------------------------
    Summary: EqualNullSafe not passing to data sources  (was: EqualNotNull not passing to
data sources)

> EqualNullSafe not passing to data sources
> -----------------------------------------
>
>                 Key: SPARK-9814
>                 URL: https://issues.apache.org/jira/browse/SPARK-9814
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>            Priority: Minor
>             Fix For: 1.5.0
>
>
> When data sources (such as Parquet) tries to filter data when reading from HDFS (not
in memory), Physical planing phase passes the filter objects in {{org.apache.spark.sql.sources}},
which are appropriately built and picked up by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
> On the other hand, it does not pass {{EqualNullSafe}} filter in {{org.apache.spark.sql.catalyst.expressions}}
even though this seems possible to pass for other datasources such as Parquet and JSON. In
more detail, it does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}}
and {{PrunedScan}}, 
> {code}
> def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
> {code}
> even though the binary capability issue is solved.(https://issues.apache.org/jira/browse/SPARK-8747).
> I understand that {{CatalystScan}} can take the all raw expressions accessing to the
query planner. However, it is experimental and also it needs different interfaces (as well
as unstable for the reasons such as binary capability).
> In general, the problem below can happen.
> 1.
> {code:sql}
> SELECT * FROM table WHERE field = 1;
> {code}
>  
> 2. 
> {code:sql}
> SELECT * FROM table WHERE field <=> 1;
> {code}
> The second query can be hugely slow although the functionally is almost identical because
of the possible large network traffic (etc.) by not filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message