Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 17 Nov 2016 00:55:58 +0000 (UTC)
From: "peay (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13021248.1479326857000.299344.1479344158387@Atlassian.JIRA>
In-Reply-To: <JIRA.13021248.1479326857000@Atlassian.JIRA>
References: <JIRA.13021248.1479326857000@Atlassian.JIRA> <JIRA.13021248.1479326857364@arcas>
Subject: [jira] [Comment Edited] (SPARK-18473) Correctness issue in INNER
 join result with window functions
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 17 Nov 2016 00:56:00 -0000


    [ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15672245#comment-15672245 ] 

peay edited comment on SPARK-18473 at 11/17/16 12:55 AM:
---------------------------------------------------------

Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? (edit: nevermind, the fix appears to be in 2.0.2 indeed).


was (Author: peay):
Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct?

> Correctness issue in INNER join result with window functions
> ------------------------------------------------------------
>
>                 Key: SPARK-18473
>                 URL: https://issues.apache.org/jira/browse/SPARK-18473
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core, SQL
>    Affects Versions: 2.0.1
>            Reporter: peay
>            Assignee: Xiao Li
>
> I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-----+---------+------+--------+--------+----------+------+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-----+---------+------+--------+--------+----------+------+
> |    1|        1|     1|       0|       1|         0|     1|
> |    2|        2|     0|       0|       1|         0|     1|
> |    1|        3|     1|       0|       2|         0|     2|
> +-----+---------+------+--------+--------+----------+------+
> {code}
> with
> {code}
> +------+
> |sessId|
> +------+
> |     1|
> |     2|
> +------+
> {code}
> The result is
> {code}
> +------+-----+---------+------+--------+--------+----------+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +------+-----+---------+------+--------+--------+----------+
> |     1|    2|        2|     0|       0|       1|         0|
> |     2|    1|        1|     1|       0|       1|        -1|
> |     2|    1|        3|     1|       0|       2|         0|
> +------+-----+---------+------+--------+--------+----------+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
>     pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
>     df1
>     .withColumn("hasOne", (col("index") == 1).cast("int"))
>     .withColumn("hasFifty", (col("index") == 50).cast("int"))
>     .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
>     .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
>     .fillna({ 'numFiftyStrictlyBefore': 0 })
>     .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org