Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 1 Dec 2015 17:33:11 +0000 (UTC)
From: "Apache Spark (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12916689.1448659479000.234909.1448991191168@Atlassian.JIRA>
In-Reply-To: <JIRA.12916689.1448659479000@Atlassian.JIRA>
References: <JIRA.12916689.1448659479000@Atlassian.JIRA>
 <JIRA.12916689.1448659479557@arcas>
Subject: [jira] [Commented] (SPARK-12030) Incorrect results when aggregate
 joined data
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/SPARK-12030?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D150=
34141#comment-15034141 ]=20

Apache Spark commented on SPARK-12030:
--------------------------------------

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/10068

> Incorrect results when aggregate joined data
> --------------------------------------------
>
>                 Key: SPARK-12030
>                 URL: https://issues.apache.org/jira/browse/SPARK-12030
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: Maciej Bry=C5=84ski
>            Priority: Blocker
>         Attachments: spark.jpg, t1.tar.gz, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t=
2)
> {code}
> t1 =3D sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache=
()
> t2 =3D sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined =3D t1.join(t2, t1.fk1 =3D=3D t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every =
query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and=20
> 5900000 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*)=
 > 1").collect()=20
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 =3D result").collect())
> {code}
> What's wrong ?


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org