Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 668E7200C38 for ; Wed, 15 Mar 2017 23:32:49 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 653B2160B78; Wed, 15 Mar 2017 22:32:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AD55A160B60 for ; Wed, 15 Mar 2017 23:32:48 +0100 (CET) Received: (qmail 80977 invoked by uid 500); 15 Mar 2017 22:32:46 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 80968 invoked by uid 99); 15 Mar 2017 22:32:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Mar 2017 22:32:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E09D31A0503 for ; Wed, 15 Mar 2017 22:32:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.451 X-Spam-Level: * X-Spam-Status: No, score=1.451 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id HoJiNt4ibL8r for ; Wed, 15 Mar 2017 22:32:44 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 46E2260E24 for ; Wed, 15 Mar 2017 22:32:44 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id D60B6E0AE8 for ; Wed, 15 Mar 2017 22:32:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id F212F243C0 for ; Wed, 15 Mar 2017 22:32:41 +0000 (UTC) Date: Wed, 15 Mar 2017 22:32:41 +0000 (UTC) From: "Hyukjin Kwon (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 15 Mar 2017 22:32:49 -0000 [ https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19954. ---------------------------------- Resolution: Duplicate > Joining to a unioned DataFrame does not produce expected result. > ---------------------------------------------------------------- > > Key: SPARK-19954 > URL: https://issues.apache.org/jira/browse/SPARK-19954 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Arun Allamsetty > Priority: Blocker > > I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is that when we try to join two DataFrames, one of which is a result of a union operation, the result of the join results in data as if the table was joined only to the first table in the union. This issue is not present in Spark 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it. > {noformat} > import spark.implicits._ > import org.apache.spark.sql.functions.lit > case class A(id: Long, colA: Boolean) > case class B(id: Long, colB: Int) > case class C(id: Long, colC: Double) > case class X(id: Long, name: String) > val aData = A(1, true) :: Nil > val bData = B(2, 10) :: Nil > val cData = C(3, 9.73D) :: Nil > val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil > val aDf = spark.createDataset(aData).toDF > val bDf = spark.createDataset(bData).toDF > val cDf = spark.createDataset(cData).toDF > val xDf = spark.createDataset(xData).toDF > val unionDf = > aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), lit(null).as("colC")).union( > bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", lit(null).as("colC"))).union( > cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), lit(null).as("colB"), $"colC")) > val result = xDf.join(unionDf, unionDf("name") === xDf("name") && unionDf("id") === xDf("id")) > result.show > {noformat} > The result being > {noformat} > +---+----+---+----+----+----+----+ > | id|name| id|name|colA|colB|colC| > +---+----+---+----+----+----+----+ > | 1| a| 1| a|true|null|null| > +---+----+---+----+----+----+----+ > {noformat} > Force computing {{unionDf}} using {{count}} does not help change the result of the join. However, writing the data to disk and reading it back does give the correct result. But it is definitely not ideal. Interestingly caching the {{unionDf}} also gives the correct result. > {noformat} > +---+----+---+----+----+----+----+ > | id|name| id|name|colA|colB|colC| > +---+----+---+----+----+----+----+ > | 1| a| 1| a|true|null|null| > | 2| b| 2| b|null| 10|null| > | 3| c| 3| c|null|null|9.73| > +---+----+---+----+----+----+----+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org