spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takeshi Yamamuro (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-13801) DataFrame.col should return unresolved attribute
Date Thu, 14 Apr 2016 08:16:25 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240787#comment-15240787
] 

Takeshi Yamamuro commented on SPARK-13801:
------------------------------------------

Your example is not related to this ticket.
Actually, the current master returns a correct answer like;
{code}
+--------------+--------------+--------------+                                  
|coalesce(b, b)|coalesce(c, c)|coalesce(d, d)|
+--------------+--------------+--------------+
|             0|             0|             0|
|             1|             1|             1|
+--------------+--------------+--------------+
{code}

> DataFrame.col should return unresolved attribute
> ------------------------------------------------
>
>                 Key: SPARK-13801
>                 URL: https://issues.apache.org/jira/browse/SPARK-13801
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame API. After
checking their queries, I found it was caused by un-direct self-join and they build wrong
join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to the same
column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. However, resolved
attribute is not real column, as logical plan's output may change. For example, we will generate
new output for the right child in self-join.
> My proposal is: `DataFrame.col` should always return unresolved attribute. We can still
do the resolution to make sure the given column name is resolvable, but don't return the resolved
one, just get the name out and wrap it with UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get analysis exception,
and they will understand they need to alias df and df2 before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message