spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manan Bakshi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
Date Wed, 21 Feb 2018 04:05:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370917#comment-16370917
] 

Manan Bakshi commented on SPARK-23463:
--------------------------------------

Hi Marco,

That makes sense. However, this same code used to work fine for Spark 2.1.1 regardless
of whether you compare against 0 or 0.0. Can you help me understand what changed?

> Filter operation fails to handle blank values and evicts rows that even satisfy the filtering
condition
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-23463
>                 URL: https://issues.apache.org/jira/browse/SPARK-23463
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.1
>            Reporter: Manan Bakshi
>            Priority: Critical
>         Attachments: sample
>
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was introduced to
look at the table stats and decide filter selectivity. However, since then, filter has started
behaving unexpectedly for blank values. The operation would not only drop columns with blank
values but also filter out rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields unexpected
results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is replaced
by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous versions filter
as expected for both int (0) and float (0.0) values in the filter condition.
> Also, if there are no blank values, the filter operation works as expected for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message