spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liang-Chi Hsieh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-9323) DataFrame does not properly resolve nested columns
Date Sat, 25 Jul 2015 10:16:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641496#comment-14641496
] 

Liang-Chi Hsieh edited comment on SPARK-9323 at 7/25/15 10:15 AM:
------------------------------------------------------------------

Currently we resolve "a.b" in ResolveAliases as an alias Alias("a.b" AS "b"). So the following
plans can't refer an attribute called "a.b".

sql("SELECT a.b FROM nestedOrder ORDER BY a.b") can work because we have special handling
for Sort in ResolveSortReferences.

Thus, sql("SELECT a.b FROM nestedOrder HAVING a.b = 1") will throw the error as well. But
sql("SELECT a.b FROM nestedOrder HAVING b = 1") works.

The following codes work too.
{code}
sqlContext.read.json(sqlContext.sparkContext.makeRDD(
    """{"a": {"b": 1, "a": {"a": 1}}, "c": [{"d": 1}]}""" :: Nil))
  .registerTempTable("nestedOrder")
  checkAnswer(sql("select * from nestedOrder").select("a.b").orderBy("b"), Row(1))

val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}""" :: Nil))
checkAnswer(df.select("a.b").filter("b = b"), Row(1))
{code}







was (Author: viirya):
Currently we resolve "a.b" in ResolveAliases as an alias Alias("a.b" AS "b"). So the following
plans can't refer an attribute called "a.b".

sql("SELECT a.b FROM nestedOrder ORDER BY a.b") can work because we have special handling
for Sort in ResolveSortReferences.

Thus, sql("SELECT a.b FROM nestedOrder HAVING a.b = 1") will throw the error as well. But
sql("SELECT a.b FROM nestedOrder HAVING b = 1") works.

The following codes work too.
{code}
sqlContext.read.json(sqlContext.sparkContext.makeRDD(
    """{"a": {"b": 1, "a": {"a": 1}}, "c": [{"d": 1}]}""" :: Nil))
  .registerTempTable("nestedOrder")
   checkAnswer(sql("select * from nestedOrder").select("a.b").orderBy("b"), Row(1))

val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}""" :: Nil))
checkAnswer(df.select("a.b").filter("b = b"), Row(1))
{code}






> DataFrame does not properly resolve nested columns
> --------------------------------------------------
>
>                 Key: SPARK-9323
>                 URL: https://issues.apache.org/jira/browse/SPARK-9323
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.1, 1.4.1, 1.5.0
>            Reporter: Josh Rosen
>
> The following two queries should be equivalent, but the second crashes:
> {code}
> sqlContext.read.json(sqlContext.sparkContext.makeRDD(
>     """{"a": {"b": 1, "a": {"a": 1}}, "c": [{"d": 1}]}""" :: Nil))
>   .registerTempTable("nestedOrder")
>    checkAnswer(sql("SELECT a.b FROM nestedOrder ORDER BY a.b"), Row(1))
>    checkAnswer(sql("select * from nestedOrder").select("a.b").orderBy("a.b"), Row(1))
> {code}
> Here's the stacktrace:
> {code}
> Cannot resolve column name "a.b" among (b);
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "a.b" among (b);
> 	at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
> 	at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
> 	at scala.Option.getOrElse(Option.scala:120)
> 	at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
> 	at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651)
> 	at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640)
> 	at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
> 	at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> 	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 	at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593)
> 	at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624)
> 	at org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389)
> {code}
> Per [~marmbrus], the problem may be that {{DataFrame.resolve}} calls {{resolveQuoted}},
causing the nested field to be treated as a single field named {{a.b}}.
> UPDATE: here's a shorter one-liner reproduction:
> {code}
>     val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}"""
:: Nil))
>     checkAnswer(df.select("a.b").filter("a.b = a.b"), Row(1))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message