spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eyal Farago (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned
Date Sat, 03 Nov 2018 07:50:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673957#comment-16673957
] 

Eyal Farago commented on SPARK-25548:
-------------------------------------

[~eaton], I think there are two possible approaches to handle this:

first would be extracting the partitions predicate and _And_ing it with the original predicate:
{code:java}
select * from src_par where 
(P_d in (2,3)) and 
((p_d=2 and key=2) or (p_d=3 and key=3))
{code}
second approach would be transforming this into a union:
{code:java}
select * from src_par where (p_d=2 and key=2) 
UNION ALL
select * from src_par where (p_d=3 and key=3)
{code}
I think second approach is easier to implement but it'd require additional rules to make sure
partitioned are not scanned multiple times, ie. consider what'd happen if your predicate looked
like this:
{code:java}
 (p_d=2 and key=2) or (p_d=3 and key=3) or (p_d=2 and key=33)
{code}
a naive approach would scan partition #2 twice while it's pretty obvious this can be avoided
by _OR_ing the first and third conditions.

The first approach seems a bit more complicated by I think it somewhat resembles what you've
started implementing in your pr, [~cloud_fan] your thoughts?

> In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true
in the And(partitionOps, nonPartitionOps) to make the partition can be pruned
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25548
>                 URL: https://issues.apache.org/jira/browse/SPARK-25548
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.2
>            Reporter: eaton
>            Assignee: Apache Spark
>            Priority: Critical
>
> In the PruneFileSourcePartitions optimizer, the partition files will not be pruned if
we use partition filter and non partition filter together, for example:
> sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned by(p_d int)
stored as parquet ")
>  sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as value")
>  sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as value")
>  sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as value")
> The sql below will scan all the partition files, in which, the partition **p_d=4** should
be pruned.
>  **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and key=3)").show**



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message