spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yin Huai <yh...@databricks.com>
Subject Re: dataframe left joins are not working as expected in pyspark
Date Sat, 27 Jun 2015 20:56:43 GMT
Axel,

Can you file a jira and attach your code in the description of the jira?
This looks like a bug.

For the third row of df1, the name is "alice" instead of "carol", right?
Otherwise, "carol" should appear in the expected output.

Btw, to get rid of those columns with the same name after the join, you can
use select to pick columns you want to include in the results.

Thanks,

Yin

On Sat, Jun 27, 2015 at 11:29 AM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> I would test it against 1.3 to be sure, because it could -- though
> unlikely -- be a regression. For example, I recently stumbled upon this
> issue <https://issues.apache.org/jira/browse/SPARK-8670> which was
> specific to 1.4.
>
> On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl <axel@whisperstream.com> wrote:
>
>> I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's
>> code would be failing right now.
>>
>> On Saturday, June 27, 2015, Nicholas Chammas <nicholas.chammas@gmail.com>
>> wrote:
>>
>>> Yeah, you shouldn't have to rename the columns before joining them.
>>>
>>> Do you see the same behavior on 1.3 vs 1.4?
>>>
>>> Nick
>>> 2015년 6월 27일 (토) 오전 2:51, Axel Dahl <axel@whisperstream.com>님이
작성:
>>>
>>>> still feels like a bug to have to create unique names before a join.
>>>>
>>>> On Fri, Jun 26, 2015 at 9:51 PM, ayan guha <guha.ayan@gmail.com> wrote:
>>>>
>>>>> You can declare the schema with unique names before creation of df.
>>>>> On 27 Jun 2015 13:01, "Axel Dahl" <axel@whisperstream.com> wrote:
>>>>>
>>>>>>
>>>>>> I have the following code:
>>>>>>
>>>>>> from pyspark import SQLContext
>>>>>>
>>>>>> d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
>>>>>> 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age':
3}]
>>>>>> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
>>>>>> {'name':'alice', 'country': 'ire', 'colour':'green'}]
>>>>>>
>>>>>> r1 = sc.parallelize(d1)
>>>>>> r2 = sc.parallelize(d2)
>>>>>>
>>>>>> sqlContext = SQLContext(sc)
>>>>>> df1 = sqlContext.createDataFrame(d1)
>>>>>> df2 = sqlContext.createDataFrame(d2)
>>>>>> df1.join(df2, df1.name == df2.name and df1.country == df2.country,
>>>>>> 'left_outer').collect()
>>>>>>
>>>>>>
>>>>>> When I run it I get the following, (notice in the first row, all
join
>>>>>> keys are take from the right-side and so are blanked out):
>>>>>>
>>>>>> [Row(age=2, country=None, name=None, colour=None, country=None,
>>>>>> name=None),
>>>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red',
>>>>>> country=u'usa', name=u'bob'),
>>>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>>>>> country=u'ire', name=u'alice')]
>>>>>>
>>>>>> I would expect to get (though ideally without duplicate columns):
>>>>>> [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
>>>>>> name=None),
>>>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red',
>>>>>> country=u'usa', name=u'bob'),
>>>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>>>>> country=u'ire', name=u'alice')]
>>>>>>
>>>>>> The workaround for now is this rather clunky piece of code:
>>>>>> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
>>>>>> 'name2').withColumnRenamed('country', 'country2')
>>>>>> df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
>>>>>> 'left_outer').collect()
>>>>>>
>>>>>> So to me it looks like a bug, but am I doing something wrong?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Axel
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>

Mime
View raw message