Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 93656183E3 for ; Wed, 21 Oct 2015 00:23:34 +0000 (UTC) Received: (qmail 87489 invoked by uid 500); 21 Oct 2015 00:23:30 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 87392 invoked by uid 500); 21 Oct 2015 00:23:30 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 87382 invoked by uid 99); 21 Oct 2015 00:23:30 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Oct 2015 00:23:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CC9ECC0FDD for ; Wed, 21 Oct 2015 00:23:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.879 X-Spam-Level: ** X-Spam-Status: No, score=2.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id YHEAljY2PTom for ; Wed, 21 Oct 2015 00:23:28 +0000 (UTC) Received: from mail-lb0-f179.google.com (mail-lb0-f179.google.com [209.85.217.179]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 0632420750 for ; Wed, 21 Oct 2015 00:23:28 +0000 (UTC) Received: by lbbec13 with SMTP id ec13so24793577lbb.0 for ; Tue, 20 Oct 2015 17:23:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=3oyw9LGjU3ovQIgloFyhWJvUmuwVLsltk9k/Fq3VqAk=; b=MZHgxgvK4Hs+QQDAnPqjg3Pb183PbuRaEwE0+ca722FkTYgqoOvAx5NpBr9vaJq9rI +wNc+aF2lIbtd+GYKSvBbW3LMYzH0RmT1kLfZIWF89w0wf0YT3zAt213S5FxVeVlwowk IDNEQSaWCdKOjyRQw5quiO2nSthtuoe6SRCbGN5grdcnspD2le1Jf4ZE3kidaSY7srrC PzQJm26xLY+Blh47bHG33SHltvSoRKI+c1tXLTCG1KTeWyy+sNGHt2VaPjYJKnWith6D wW20khYS93DAy8O8Ul/gzTfeayeEJD4uzYnV0B+1u0fGvXjejGMPXkJsQuihcy49LOQN xKMg== X-Received: by 10.112.172.138 with SMTP id bc10mr3362541lbc.74.1445387007507; Tue, 20 Oct 2015 17:23:27 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.207.2 with HTTP; Tue, 20 Oct 2015 17:23:08 -0700 (PDT) From: Isabelle Phan Date: Tue, 20 Oct 2015 17:23:08 -0700 Message-ID: Subject: How to distinguish columns when joining DataFrames with shared parent? To: user Content-Type: multipart/alternative; boundary=001a11c26714254455052292639d --001a11c26714254455052292639d Content-Type: text/plain; charset=UTF-8 Hello, When joining 2 DataFrames which originate from the same initial DataFrame, why can't org.apache.spark.sql.DataFrame.apply(colName: String) method distinguish which column to read? Let me illustrate this question with a simple example (ran on Spark 1.5.1): //my initial DataFrame scala> df res39: org.apache.spark.sql.DataFrame = [key: int, value: int] scala> df.show +---+-----+ |key|value| +---+-----+ | 1| 1| | 1| 10| | 2| 3| | 3| 20| | 3| 5| | 4| 10| +---+-----+ //2 children DataFrames scala> val smallValues = df.filter('value < 10) smallValues: org.apache.spark.sql.DataFrame = [key: int, value: int] scala> smallValues.show +---+-----+ |key|value| +---+-----+ | 1| 1| | 2| 3| | 3| 5| +---+-----+ scala> val largeValues = df.filter('value >= 10) largeValues: org.apache.spark.sql.DataFrame = [key: int, value: int] scala> largeValues.show +---+-----+ |key|value| +---+-----+ | 1| 10| | 3| 20| | 4| 10| +---+-----+ //Joining the children scala> smallValues .join(largeValues, smallValues("key") === largeValues("key")) .withColumn("diff", smallValues("value") - largeValues("value")) .show 15/10/20 16:59:59 WARN Column: Constructing trivially true equals predicate, 'key#41 = key#41'. Perhaps you need to use aliases. +---+-----+---+-----+----+ |key|value|key|value|diff| +---+-----+---+-----+----+ | 1| 1| 1| 10| 0| | 3| 5| 3| 20| 0| +---+-----+---+-----+----+ This last command issued a warning, but still executed the join correctly (rows with key 2 and 4 don't appear in result set). However, the "diff" column is incorrect. Is this a bug or am I missing something here? Thanks a lot for any input, Isabelle --001a11c26714254455052292639d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hello,

When joining 2 DataFrames which originate from the same initi= al DataFrame, why can't org.apache.spark.sql.DataFrame.apply(colName: S= tring) method distinguish which column to read?

Let me illustrate this ques= tion with a simple example (ran on Spark 1.5.1):

<= div>//my initial DataFrame
scala> df
res39: org.apache.spa= rk.sql.DataFrame =3D [key: int, value: int]

scala> df.show
+--= -+-----+
|key|value|
+---+-----+
|=C2=A0 1|=C2=A0=C2=A0=C2=A0 1||=C2=A0 1|=C2=A0=C2=A0 10|
|=C2=A0 2|=C2=A0=C2=A0=C2=A0 3|
|=C2=A0 = 3|=C2=A0=C2=A0 20|
|=C2=A0 3|=C2=A0=C2=A0=C2=A0 5|
|=C2=A0 4|=C2=A0= =C2=A0 10|
+---+-----+


//2 childr= en DataFrames
scala> val smallValues =3D df.filter('value < 10)<= br>smallValues: org.apache.spark.sql.DataFrame =3D [key: int, value: int]
scala> smallValues.show
+---+-----+
|key|value|
+---+----= -+
|=C2=A0 1|=C2=A0=C2=A0=C2=A0 1|
|=C2=A0 2|=C2=A0=C2=A0=C2=A0 3||=C2=A0 3|=C2=A0=C2=A0=C2=A0 5|
+---+-----+


scala> val la= rgeValues =3D df.filter('value >=3D 10)
largeValues: org.apache.s= park.sql.DataFrame =3D [key: int, value: int]

scala> largeValues.= show
+---+-----+
|key|value|
+---+-----+
|=C2=A0 1|=C2=A0=C2=A0= 10|
|=C2=A0 3|=C2=A0=C2=A0 20|
|=C2=A0 4|=C2=A0=C2=A0 10|
+---+--= ---+


/= /Joining the children
scala> smallValues
=C2=A0 .join(largeValues, smallV= alues("key") =3D=3D=3D largeValues("key"))
=C2=A0 .w= ithColumn("diff", smallValues("value") - largeValues(&q= uot;value"))
=C2=A0 .show
15/10/20 16:59:59 WARN Column: Constru= cting trivially true equals predicate, 'key#41 =3D key#41'. Perhaps= you need to use aliases.
+---+-----+---+-----+----+
|key|value|key|v= alue|diff|
+---+-----+---+-----+----+
|=C2=A0 1|=C2=A0=C2=A0=C2=A0 1|= =C2=A0 1|=C2=A0=C2=A0 10|=C2=A0=C2=A0 0|
|=C2=A0 3|=C2=A0=C2=A0=C2=A0 5|= =C2=A0 3|=C2=A0=C2=A0 20|=C2=A0=C2=A0 0|
+---+-----+---+-----+----+
<= br>
This last command issued a warning,= but still executed the join correctly (rows with key 2 and 4 don't app= ear in result set). However, the "diff" column is incorrect.
<= br>
Is this a bug or am I missing so= mething here?


Thanks a l= ot for any input,

Isabelle
--001a11c26714254455052292639d--