Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 21 Mar 2017 23:28:41 +0000 (UTC)
From: "Hyukjin Kwon (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13057210.1489801656000.83042.1490138921550@Atlassian.JIRA>
In-Reply-To: <JIRA.13057210.1489801656000@Atlassian.JIRA>
References: <JIRA.13057210.1489801656000@Atlassian.JIRA> <JIRA.13057210.1489801656005@jira-lw-us.apache.org>
Subject: [jira] [Commented] (SPARK-20008)
 hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()
 returns 1
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Tue, 21 Mar 2017 23:28:46 -0000


    [ https://issues.apache.org/jira/browse/SPARK-20008?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D159=
35540#comment-15935540 ]=20

Hyukjin Kwon commented on SPARK-20008:
--------------------------------------

[~smilegator], it seems the discussion is about deuplicates in the result i=
f I understood correctly.

The problem here is {{Set() - Set()}} should return empty {{Set()}} which w=
as previously done
However, it seems now returning {{Set(Row())}} from empty dataframes.

In the current master,

{code}
scala> spark.emptyDataFrame.except(spark.emptyDataFrame).collect()
res0: Array[org.apache.spark.sql.Row] =3D Array([])

scala> spark.emptyDataFrame.collect()
res1: Array[org.apache.spark.sql.Row] =3D Array()
{code}

I thought S=E2=88=96S=3D=E2=88=85 as below:

{code}
scala> spark.range(1).except(spark.range(1)).collect()
res14: Array[Long] =3D Array()
{code}


> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() ret=
urns 1
> -------------------------------------------------------------------------=
------
>
>                 Key: SPARK-20008
>                 URL: https://issues.apache.org/jira/browse/SPARK-20008
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2, 2.2.0
>            Reporter: Ravindra Bajpai
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yie=
lds 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage =
point of view and hence I consider this as a bug. May be a boundary case, n=
ot sure.
> Work around - For now I check the counts !=3D 0 before this operation. No=
t good for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail -=20
> Starting from Spark 2, these kind of operation are implemented in left an=
ti join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String =3D 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).e=
xplain(true)
> =3D=3D Physical Plan =3D=3D
> *HashAggregate(keys=3D[], functions=3D[], output=3D[])
> +- Exchange SinglePartition
>    +- *HashAggregate(keys=3D[], functions=3D[], output=3D[])
>       +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>          :- Scan ExistingRDD[]
>          +- BroadcastExchange IdentityBroadcastMode
>             +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing =
NULL =3D NULL, should it return true or false, causing this kind of confusi=
on.=20


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org