Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D376F18AFF for ; Tue, 1 Dec 2015 17:33:11 +0000 (UTC) Received: (qmail 67630 invoked by uid 500); 1 Dec 2015 17:33:11 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 67382 invoked by uid 500); 1 Dec 2015 17:33:11 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 67247 invoked by uid 99); 1 Dec 2015 17:33:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Dec 2015 17:33:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 29D752C1F78 for ; Tue, 1 Dec 2015 17:33:11 +0000 (UTC) Date: Tue, 1 Dec 2015 17:33:11 +0000 (UTC) From: "Apache Spark (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-12030?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D150= 34141#comment-15034141 ]=20 Apache Spark commented on SPARK-12030: -------------------------------------- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/10068 > Incorrect results when aggregate joined data > -------------------------------------------- > > Key: SPARK-12030 > URL: https://issues.apache.org/jira/browse/SPARK-12030 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.0 > Reporter: Maciej Bry=C5=84ski > Priority: Blocker > Attachments: spark.jpg, t1.tar.gz, t2.tar.gz > > > I have following issue. > I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t= 2) > {code} > t1 =3D sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache= () > t2 =3D sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache() > joined =3D t1.join(t2, t1.fk1 =3D=3D t2.id2, "left_outer") > {code} > Important: both table are cached, so results should be the same on every = query. > Then I did come counts: > {code} > t1.count() -> 5900729 > t1.registerTempTable("t1") > sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729 > t2.count() -> 54298 > joined.count() -> 5900729 > {code} > And here magic begins - I counted distinct id1 from joined table > {code} > joined.registerTempTable("joined") > sqlCtx.sql("select distinct(id1) from joined").count() > {code} > Results varies *(are different on every run)* between 5899000 and=20 > 5900000 but never are equal to 5900729. > In addition. I did more queries: > {code} > sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*)= > 1").collect()=20 > {code} > This gives some results but this query return *1* > {code} > len(sqlCtx.sql("select * from joined where id1 =3D result").collect()) > {code} > What's wrong ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org