Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
From: Isabelle Phan <nliphan@gmail.com>
Date: Tue, 20 Oct 2015 17:23:08 -0700
Message-ID: 
 <CAFQ3t_zgNka1fOZQZNUqyO-6F9VqF7TLHOCqDFfMAzckX1hoFA@mail.gmail.com>
Subject: How to distinguish columns when joining DataFrames with shared
 parent?
To: user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a11c26714254455052292639d

--001a11c26714254455052292639d
Content-Type: text/plain; charset=UTF-8

Hello,

When joining 2 DataFrames which originate from the same initial DataFrame,
why can't org.apache.spark.sql.DataFrame.apply(colName: String) method
distinguish which column to read?

Let me illustrate this question with a simple example (ran on Spark 1.5.1):

//my initial DataFrame
scala> df
res39: org.apache.spark.sql.DataFrame = [key: int, value: int]

scala> df.show
+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  1|   10|
|  2|    3|
|  3|   20|
|  3|    5|
|  4|   10|
+---+-----+


//2 children DataFrames
scala> val smallValues = df.filter('value < 10)
smallValues: org.apache.spark.sql.DataFrame = [key: int, value: int]

scala> smallValues.show
+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  2|    3|
|  3|    5|
+---+-----+


scala> val largeValues = df.filter('value >= 10)
largeValues: org.apache.spark.sql.DataFrame = [key: int, value: int]

scala> largeValues.show
+---+-----+
|key|value|
+---+-----+
|  1|   10|
|  3|   20|
|  4|   10|
+---+-----+


//Joining the children
scala> smallValues
  .join(largeValues, smallValues("key") === largeValues("key"))
  .withColumn("diff", smallValues("value") - largeValues("value"))
  .show
15/10/20 16:59:59 WARN Column: Constructing trivially true equals
predicate, 'key#41 = key#41'. Perhaps you need to use aliases.
+---+-----+---+-----+----+
|key|value|key|value|diff|
+---+-----+---+-----+----+
|  1|    1|  1|   10|   0|
|  3|    5|  3|   20|   0|
+---+-----+---+-----+----+


This last command issued a warning, but still executed the join correctly
(rows with key 2 and 4 don't appear in result set). However, the "diff"
column is incorrect.

Is this a bug or am I missing something here?


Thanks a lot for any input,

Isabelle

--001a11c26714254455052292639d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><span style=3D"font-family:arial,helvetica,sans-=
serif">Hello,<br><br></span></div><span style=3D"font-family:arial,helvetic=
a,sans-serif">When joining 2 DataFrames which originate from the same initi=
al DataFrame, why can&#39;t org.apache.spark.sql.DataFrame.apply(colName: S=
tring) method distinguish which column to read?<br><br></span></div><span s=
tyle=3D"font-family:arial,helvetica,sans-serif">Let me illustrate this ques=
tion with a simple example (ran on Spark 1.5.1):</span><br><span style=3D"f=
ont-family:arial,helvetica,sans-serif"></span><div><div><span style=3D"font=
-family:monospace,monospace"></span><div><span style=3D"color:rgb(39,78,19)=
"><span style=3D"font-family:monospace,monospace"><br></span></span></div><=
div><span style=3D"font-family:monospace,monospace"><span style=3D"color:rg=
b(56,118,29)">//my initial DataFrame</span><br></span></div><div><span styl=
e=3D"font-family:monospace,monospace">scala&gt; df<br>res39: org.apache.spa=
rk.sql.DataFrame =3D [key: int, value: int]<br><br>scala&gt; df.show<br>+--=
-+-----+<br>|key|value|<br>+---+-----+<br>|=C2=A0 1|=C2=A0=C2=A0=C2=A0 1|<b=
r>|=C2=A0 1|=C2=A0=C2=A0 10|<br>|=C2=A0 2|=C2=A0=C2=A0=C2=A0 3|<br>|=C2=A0 =
3|=C2=A0=C2=A0 20|<br>|=C2=A0 3|=C2=A0=C2=A0=C2=A0 5|<br>|=C2=A0 4|=C2=A0=
=C2=A0 10|<br>+---+-----+<br><br><br></span></div><div><span style=3D"font-=
family:monospace,monospace"><span style=3D"color:rgb(56,118,29)">//2 childr=
en DataFrames</span><br></span></div><div><span style=3D"font-family:monosp=
ace,monospace">scala&gt; val smallValues =3D df.filter(&#39;value &lt; 10)<=
br>smallValues: org.apache.spark.sql.DataFrame =3D [key: int, value: int]<b=
r><br>scala&gt; smallValues.show<br>+---+-----+<br>|key|value|<br>+---+----=
-+<br>|=C2=A0 1|=C2=A0=C2=A0=C2=A0 1|<br>|=C2=A0 2|=C2=A0=C2=A0=C2=A0 3|<br=
>|=C2=A0 3|=C2=A0=C2=A0=C2=A0 5|<br>+---+-----+<br><br><br>scala&gt; val la=
rgeValues =3D df.filter(&#39;value &gt;=3D 10)<br>largeValues: org.apache.s=
park.sql.DataFrame =3D [key: int, value: int]<br><br>scala&gt; largeValues.=
show<br>+---+-----+<br>|key|value|<br>+---+-----+<br>|=C2=A0 1|=C2=A0=C2=A0=
 10|<br>|=C2=A0 3|=C2=A0=C2=A0 20|<br>|=C2=A0 4|=C2=A0=C2=A0 10|<br>+---+--=
---+<br></span><br><span style=3D"font-family:monospace,monospace"><span st=
yle=3D"font-family:monospace,monospace"><br></span></span><div><span style=
=3D"font-family:monospace,monospace"><span style=3D"color:rgb(56,118,29)">/=
/Joining the children<br></span></span></div><span style=3D"font-family:mon=
ospace,monospace">scala&gt; smallValues<br>=C2=A0 .join(largeValues, smallV=
alues(&quot;key&quot;) =3D=3D=3D largeValues(&quot;key&quot;))<br>=C2=A0 .w=
ithColumn(&quot;diff&quot;, smallValues(&quot;value&quot;) - largeValues(&q=
uot;value&quot;))<br>=C2=A0 .show<br>15/10/20 16:59:59 WARN Column: Constru=
cting trivially true equals predicate, &#39;key#41 =3D key#41&#39;. Perhaps=
 you need to use aliases.<br>+---+-----+---+-----+----+<br>|key|value|key|v=
alue|diff|<br>+---+-----+---+-----+----+<br>|=C2=A0 1|=C2=A0=C2=A0=C2=A0 1|=
=C2=A0 1|=C2=A0=C2=A0 10|=C2=A0=C2=A0 0|<br>|=C2=A0 3|=C2=A0=C2=A0=C2=A0 5|=
=C2=A0 3|=C2=A0=C2=A0 20|=C2=A0=C2=A0 0|<br>+---+-----+---+-----+----+<br><=
br><br></span></div><div><span style=3D"font-family:monospace,monospace"><f=
ont face=3D"arial,helvetica,sans-serif">This last command issued a warning,=
 but still executed the join correctly (rows with key 2 and 4 don&#39;t app=
ear in result set). However, the &quot;diff&quot; column is incorrect.<br><=
br></font></span></div><div><span style=3D"font-family:monospace,monospace"=
><font face=3D"arial,helvetica,sans-serif">Is this a bug or am I missing so=
mething here?<br><br><br></font></span></div><div><span style=3D"font-famil=
y:monospace,monospace"><font face=3D"arial,helvetica,sans-serif">Thanks a l=
ot for any input,<br><br></font></span></div><div><span style=3D"font-famil=
y:monospace,monospace"><font face=3D"arial,helvetica,sans-serif">Isabelle<b=
r></font></span></div></div></div></div>

--001a11c26714254455052292639d--