spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "holdenk (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-24780) DataFrame.column_name should take into account DataFrame alias for future joins
Date Wed, 11 Jul 2018 01:48:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-24780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

holdenk updated SPARK-24780:
----------------------------
    Description: 
If we join a dataframe with another dataframe which has the same column name of the conditions
(e.g. shared lineage on one of the conditions) even though the join condition may be written
with the full name, the columns returned don't have the dataframe alias and as such will create
a cross-join.

For example this currently works even if both posts_by_sampled_authors  &  mailing_list_posts_in_reply_to
contain both in_reply_to and message_id fields.

 
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [F.col("mailing_list_posts_in_reply_to.in_reply_to") == F.col("posts_by_sampled_authors.message_id")],
 "inner"){code}
 

But a similarly written expression:
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [mailing_list_posts_in_reply_to.in_reply_to == posts_by_sampled_authors.message_id],
 "inner"){code}
will fail.

 

I'm not super sure whats going on inside of the resolution that's causing it to get confused.

  was:
If we join a dataframe with another dataframe which has the same column name of the conditions
(e.g. shared lineage on one of the conditions) even though the join condition may be written
with the full name, the columns returned don't have the dataframe alias and as such will create
a cross-join.

For example this currently works even if both posts_by_sampled_authors  &  mailing_list_posts_in_reply_to
contain both in_reply_to and message_id fields.

 
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [F.col("mailing_list_posts_in_reply_to.in_reply_to") == F.col("posts_by_sampled_authors.message_id")],
 "inner"){code}
 

But a similarly written expression:
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [mailing_list_posts_in_reply_to.in_reply_to == posts_by_sampled_authors.message_id],
 "inner"){code}
will fail.

 

We could fix this by changing it so that dataframe.column in PySpark returns the fully qualified
column reference if the dataframe has an alias.


> DataFrame.column_name should take into account DataFrame alias for future joins
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-24780
>                 URL: https://issues.apache.org/jira/browse/SPARK-24780
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 2.4.0
>            Reporter: holdenk
>            Priority: Minor
>
> If we join a dataframe with another dataframe which has the same column name of the conditions
(e.g. shared lineage on one of the conditions) even though the join condition may be written
with the full name, the columns returned don't have the dataframe alias and as such will create
a cross-join.
> For example this currently works even if both posts_by_sampled_authors  &  mailing_list_posts_in_reply_to
contain both in_reply_to and message_id fields.
>  
> {code:java}
> posts_with_replies = posts_by_sampled_authors.join(
>  mailing_list_posts_in_reply_to,
>  [F.col("mailing_list_posts_in_reply_to.in_reply_to") == F.col("posts_by_sampled_authors.message_id")],
>  "inner"){code}
>  
> But a similarly written expression:
> {code:java}
> posts_with_replies = posts_by_sampled_authors.join(
>  mailing_list_posts_in_reply_to,
>  [mailing_list_posts_in_reply_to.in_reply_to == posts_by_sampled_authors.message_id],
>  "inner"){code}
> will fail.
>  
> I'm not super sure whats going on inside of the resolution that's causing it to get confused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message