spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Pollock (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-11231) join returns schema with duplicated and ambiguous join columns
Date Wed, 21 Oct 2015 15:28:27 GMT

     [ https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matt Pollock updated SPARK-11231:
---------------------------------
    Description: 
In the case where the key column of two data frames are named the same thing, join returns
a data frame where that column is duplicated. Since the content of the columns is guaranteed
to be the same by row consolidating the identical columns into a single column would replicate
standard R behavior[1] and help prevent ambiguous names.

Example:
{code}
> df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> sdf1 <- createDataFrame(sqlContext, df1)
> sdf2 <- createDataFrame(sqlContext, df2)
> sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> schema(sjdf)
StructType
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value1", type = "DoubleType", nullable = TRUE
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value2", type = "DoubleType", nullable = TRUE
{code}

The duplicated key columns cause things like:
{code}
> library(magrittr)
> sjdf %>% select("key")
15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could be: key#125,
key#127.;
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
	at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
	at org.apache.spark.sql.catalyst.tree
{code}

[1] In base R a similar function merge is provided which a "by" argument identifies the shared
key column in the two data frames. In the case where the key column names differ "by.x" and
"by.y" arguments can be used. In the case of same-named key columns the consolidation behavior
requested above is observed. In the case of differing names they "by.x" name is retained and
consolidated with the "by.y" column which is dropped.
{code}
> df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> merge(df1, df2, by="key")
  key value1 value2
1   A      1      4
2   B      2      5
3   C      3      6

df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> merge(df2, df3, by.x="key", by.y="akey")
  key value2 value1
1   A      4      1
2   B      5      2
3   C      6      3
> merge(df3, df2, by.x="akey", by.y="key")
  akey value1 value2
1    A      1      4
2    B      2      5
3    C      3      6
{code}

  was:
In the case where the key column of two data frames are named the same thing, join returns
a data frame where that column is duplicated. Since the content of the columns is guaranteed
to be the same by row consolidating the identical columns into a single column would replicate
standard R behavior* and help prevent ambiguous names.

Example:
{code}
> df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> sdf1 <- createDataFrame(sqlContext, df1)
> sdf2 <- createDataFrame(sqlContext, df2)
> sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> schema(sjdf)
StructType
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value1", type = "DoubleType", nullable = TRUE
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value2", type = "DoubleType", nullable = TRUE
{code}

The duplicated key columns cause things like:
{code}
> library(magrittr)
> sjdf %>% select("key")
15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could be: key#125,
key#127.;
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
	at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
	at org.apache.spark.sql.catalyst.tree
{code}

* In base R a similar function merge is provided which a "by" argument identifies the shared
key column in the two data frames. In the case where the key column names differ "by.x" and
"by.y" arguments can be used. In the case of same-named key columns the consolidation behavior
requested above is observed. In the case of differing names they "by.x" name is retained and
consolidated with the "by.y" column which is dropped.
{code}
> df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> merge(df1, df2, by="key")
  key value1 value2
1   A      1      4
2   B      2      5
3   C      3      6

df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> merge(df2, df3, by.x="key", by.y="akey")
  key value2 value1
1   A      4      1
2   B      5      2
3   C      6      3
> merge(df3, df2, by.x="akey", by.y="key")
  akey value1 value2
1    A      1      4
2    B      2      5
3    C      3      6
{code}


> join returns schema with duplicated and ambiguous join columns
> --------------------------------------------------------------
>
>                 Key: SPARK-11231
>                 URL: https://issues.apache.org/jira/browse/SPARK-11231
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.5.1
>         Environment: R
>            Reporter: Matt Pollock
>
> In the case where the key column of two data frames are named the same thing, join returns
a data frame where that column is duplicated. Since the content of the columns is guaranteed
to be the same by row consolidating the identical columns into a single column would replicate
standard R behavior[1] and help prevent ambiguous names.
> Example:
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > sdf1 <- createDataFrame(sqlContext, df1)
> > sdf2 <- createDataFrame(sqlContext, df2)
> > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> > schema(sjdf)
> StructType
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value1", type = "DoubleType", nullable = TRUE
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value2", type = "DoubleType", nullable = TRUE
> {code}
> The duplicated key columns cause things like:
> {code}
> > library(magrittr)
> > sjdf %>% select("key")
> 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could be: key#125,
key#127.;
> 	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
> 	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
> 	at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
> 	at org.apache.spark.sql.catalyst.tree
> {code}
> [1] In base R a similar function merge is provided which a "by" argument identifies the
shared key column in the two data frames. In the case where the key column names differ "by.x"
and "by.y" arguments can be used. In the case of same-named key columns the consolidation
behavior requested above is observed. In the case of differing names they "by.x" name is retained
and consolidated with the "by.y" column which is dropped.
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > merge(df1, df2, by="key")
>   key value1 value2
> 1   A      1      4
> 2   B      2      5
> 3   C      3      6
> df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> > merge(df2, df3, by.x="key", by.y="akey")
>   key value2 value1
> 1   A      4      1
> 2   B      5      2
> 3   C      6      3
> > merge(df3, df2, by.x="akey", by.y="key")
>   akey value1 value2
> 1    A      1      4
> 2    B      2      5
> 3    C      3      6
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message