spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Join on DataFrames from the same source (Pyspark)
Date Fri, 24 Apr 2015 18:56:34 GMT
fixed in master:
https://github.com/apache/spark/commit/2d010f7afe6ac8e67e07da6bea700e9e8c9e6cc2

On Wed, Apr 22, 2015 at 12:19 AM, Karlson <ksonspark@siberie.de> wrote:

> DataFrames do not have the attributes 'alias' or 'as' in the Python API.
>
>
> On 2015-04-21 20:41, Michael Armbrust wrote:
>
>> This is https://issues.apache.org/jira/browse/SPARK-6231
>>
>> Unfortunately this is pretty hard to fix as its hard for us to
>> differentiate these without aliases.  However you can add an alias as
>> follows:
>>
>> from pyspark.sql.functions import *
>> df.alias("a").join(df.alias("b"), col("a.col1") == col("b.col1"))
>>
>> On Tue, Apr 21, 2015 at 8:10 AM, Karlson <ksonspark@siberie.de> wrote:
>>
>>  Sorry, my code actually was
>>>
>>>     df_one = df.select('col1', 'col2')
>>>     df_two = df.select('col1', 'col3')
>>>
>>> But in Spark 1.4.0 this does not seem to make any difference anyway and
>>> the problem is the same with both versions.
>>>
>>>
>>>
>>> On 2015-04-21 17:04, ayan guha wrote:
>>>
>>>  your code should be
>>>>
>>>>  df_one = df.select('col1', 'col2')
>>>>  df_two = df.select('col1', 'col3')
>>>>
>>>> Your current code is generating a tupple, and of course df_1 and df_2
>>>> are
>>>> different, so join is yielding to cartesian.
>>>>
>>>> Best
>>>> Ayan
>>>>
>>>> On Wed, Apr 22, 2015 at 12:42 AM, Karlson <ksonspark@siberie.de> wrote:
>>>>
>>>>  Hi,
>>>>
>>>>>
>>>>> can anyone confirm (and if so elaborate on) the following problem?
>>>>>
>>>>> When I join two DataFrames that originate from the same source
>>>>> DataFrame,
>>>>> the resulting DF will explode to a huge number of rows. A quick
>>>>> example:
>>>>>
>>>>> I load a DataFrame with n rows from disk:
>>>>>
>>>>>     df = sql_context.parquetFile('data.parquet')
>>>>>
>>>>> Then I create two DataFrames from that source.
>>>>>
>>>>>     df_one = df.select(['col1', 'col2'])
>>>>>     df_two = df.select(['col1', 'col3'])
>>>>>
>>>>> Finally I want to (inner) join them back together:
>>>>>
>>>>>     df_joined = df_one.join(df_two, df_one['col1'] == df_two['col2'],
>>>>> 'inner')
>>>>>
>>>>> The key in col1 is unique. The resulting DataFrame should have n rows,
>>>>> however it does have n*n rows.
>>>>>
>>>>> That does not happen, when I load df_one and df_two from disk
>>>>> directly. I
>>>>> am on Spark 1.3.0, but this also happens on the current 1.4.0 snapshot.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message