spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szymon Matejczyk (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
Date Sat, 09 Apr 2016 17:28:25 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233648#comment-15233648
] 

Szymon Matejczyk edited comment on SPARK-13802 at 4/9/16 5:27 PM:
------------------------------------------------------------------

"This shows that the schema names don't have to correspond to the row's names." IIUC, in Row
only the order of fields matters. And Row should be treated like Tuple, not dict. Then, the
Row constructor variant that uses **kwargs is pointless as it sorts fields by names.

```
  def __new__(self, *args, **kwargs):
        if args and kwargs:
            raise ValueError("Can not use both args "
                             "and kwargs to create Row")
        if args:
            # create row class or objects
            return tuple.__new__(self, args)

        elif kwargs:
            # create row objects
            names = sorted(kwargs.keys())
            row = tuple.__new__(self, [kwargs[n] for n in names])
            row.__fields__ = names
            return row
```

Please correct me if I'm wrong.


was (Author: szymonm):
"This shows that the schema names don't have to correspond to the row's names." IIUC, in Row
only the order of fields matters. And Row should be treated like Tuple, not dict. Then, the
Row constructor variant that uses **kwargs is pointless as it sorts fields by names.

```
  def __new__(self, *args, **kwargs):
        if args and kwargs:
            raise ValueError("Can not use both args "
                             "and kwargs to create Row")
        if args:
            # create row class or objects
            return tuple.__new__(self, args)

        elif kwargs:
            # create row objects
            names = sorted(kwargs.keys())
            row = tuple.__new__(self, [kwargs[n] for n in names])
            row.__fields__ = names
            return row
```

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-13802
>                 URL: https://issues.apache.org/jira/browse/SPARK-13802
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.0
>            Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are sorted by
name. When Schema is reading the row, it is not using the fields in this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
>     StructField("id", StringType()),
>     StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +------+----------+
> |    id|first_name|
> +------+----------+
> |Szymon|        39|
> +------+----------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message