spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Armbrust (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
Date Fri, 14 Aug 2015 03:06:45 GMT

    [ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696378#comment-14696378
] 

Michael Armbrust commented on SPARK-8670:
-----------------------------------------

I think that all dataframe methods that take identifiers as strings should behave consistently.
 We changed scala so that they are "mostly quoted" in that spaces and other characters behave
as those it was in backticks in sql, but dots are an exception and need to be double escaped.
 (i.e. {{df["structColumn.`field.with.dots`"]}}.  The rational is that using dots to go into
a struct or to qualify an attribute is more common than column names with dots in them.

> Nested columns can't be referenced (but they can be selected)
> -------------------------------------------------------------
>
>                 Key: SPARK-8670
>                 URL: https://issues.apache.org/jira/browse/SPARK-8670
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 1.4.0, 1.4.1, 1.5.0
>            Reporter: Nicholas Chammas
>            Assignee: Wenchen Fan
>            Priority: Blocker
>
> This is strange and looks like a regression from 1.3.
> {code}
> import json
> daterz = [
>   {
>     'name': 'Nick',
>     'stats': {
>       'age': 28
>     }
>   },
>   {
>     'name': 'George',
>     'stats': {
>       'age': 31
>     }
>   }
> ]
> df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
> df.select('stats.age').show()
> df['stats.age']  # 1.4 fails on this line
> {code}
> On 1.3 this works and yields:
> {code}
> age
> 28 
> 31 
> Out[1]: Column<stats.age AS age#2958L>
> {code}
> On 1.4, however, this gives an error on the last line:
> {code}
> +---+
> |age|
> +---+
> | 28|
> | 31|
> +---+
> ---------------------------------------------------------------------------
> IndexError                                Traceback (most recent call last)
> <ipython-input-1-04bd990e94c6> in <module>()
>      19 
>      20 df.select('stats.age').show()
> ---> 21 df['stats.age']
> /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
>     678         if isinstance(item, basestring):
>     679             if item not in self.columns:
> --> 680                 raise IndexError("no such column: %s" % item)
>     681             jc = self._jdf.apply(item)
>     682             return Column(jc)
> IndexError: no such column: stats.age
> {code}
> This means, among other things, that you can't join DataFrames on nested columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message