Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 201DD200BE3 for ; Thu, 17 Nov 2016 01:56:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 1E948160B08; Thu, 17 Nov 2016 00:56:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 69B2F160B13 for ; Thu, 17 Nov 2016 01:55:59 +0100 (CET) Received: (qmail 31200 invoked by uid 500); 17 Nov 2016 00:55:58 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 31031 invoked by uid 99); 17 Nov 2016 00:55:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 00:55:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 5F8A52C4C73 for ; Thu, 17 Nov 2016 00:55:58 +0000 (UTC) Date: Thu, 17 Nov 2016 00:55:58 +0000 (UTC) From: "peay (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (SPARK-18473) Correctness issue in INNER join result with window functions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 17 Nov 2016 00:56:00 -0000 [ https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15672245#comment-15672245 ] peay edited comment on SPARK-18473 at 11/17/16 12:55 AM: --------------------------------------------------------- Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? (edit: nevermind, the fix appears to be in 2.0.2 indeed). was (Author: peay): Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? > Correctness issue in INNER join result with window functions > ------------------------------------------------------------ > > Key: SPARK-18473 > URL: https://issues.apache.org/jira/browse/SPARK-18473 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL > Affects Versions: 2.0.1 > Reporter: peay > Assignee: Xiao Li > > I have stumbled onto a corner case where an INNER join appears to return incorrect results. I believe the join should behave as the identity, but instead, some values are shuffled around, and some are just plain wrong. > This can be reproduced as follows: joining > {code} > +-----+---------+------+--------+--------+----------+------+ > |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId| > +-----+---------+------+--------+--------+----------+------+ > | 1| 1| 1| 0| 1| 0| 1| > | 2| 2| 0| 0| 1| 0| 1| > | 1| 3| 1| 0| 2| 0| 2| > +-----+---------+------+--------+--------+----------+------+ > {code} > with > {code} > +------+ > |sessId| > +------+ > | 1| > | 2| > +------+ > {code} > The result is > {code} > +------+-----+---------+------+--------+--------+----------+ > |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount| > +------+-----+---------+------+--------+--------+----------+ > | 1| 2| 2| 0| 0| 1| 0| > | 2| 1| 1| 1| 0| 1| -1| > | 2| 1| 3| 1| 0| 2| 0| > +------+-----+---------+------+--------+--------+----------+ > {code} > Note how two rows have a sessId of 2 (instead of one row as expected), and how `fiftyCount` can now be negative while always zero in the original dataframe. > The first dataframe uses two windows: > - `hasOne` uses a `window.rowsBetween(-10, 0)`. > - `hasFifty` uses a `window.rowsBetween(-10, -1)`. > The result is *correct* if: > - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of `window.rowsBetween(-10, -1)`. > - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to `fillna` -- although there are no visible effect on the dataframe as shown by `show` as far as I can tell. > - I use a LEFT OUTER join instead of INNER JOIN. > - I write both dataframes to Parquet, read them back and join these. > This can be reproduced in pyspark using: > {code} > import pyspark.sql.functions as F > from pyspark.sql.functions import col > from pyspark.sql.window import Window > df1 = sql_context.createDataFrame( > pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]}) > ) > window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index") > df2 = ( > df1 > .withColumn("hasOne", (col("index") == 1).cast("int")) > .withColumn("hasFifty", (col("index") == 50).cast("int")) > .withColumn("numOnesBefore", F.sum(col("hasOne")).over(window.rowsBetween(-10, 0))) > .withColumn("numFiftyStrictlyBefore", F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1))) > .fillna({ 'numFiftyStrictlyBefore': 0 }) > .withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore")) > ) > df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]})) > df_joined = df_selector.join(df2, "sessId", how="inner") > df2.show() > df_selector.show() > df_joined.show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org