spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-21107) Pyspark: ISO-8859-1 column names inconsistently converted to UTF-8
Date Fri, 07 Jul 2017 04:00:03 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-21107.
----------------------------------
    Resolution: Invalid

I can't follow what it describes.

{code}
>>> u'L\xc3\xa0' == u"Là"
False
{code}

These are different.

It should be

{code}
>>> df = sc.parallelize([[1,2],[1,4],[2,5],[2,6]]).toDF([u"L\xe0",u"Here"])
>>> df.select('L\xc3\xa0').show()
+---+
| Là|
+---+
|  1|
|  1|
|  2|
|  2|
+---+

>>> df.select(u"Là").show()
+---+
| Là|
+---+
|  1|
|  1|
|  2|
|  2|
+---+

>>> df.select(u"L\xe0").show()
+---+
| Là|
+---+
|  1|
|  1|
|  2|
|  2|
+---+
{code}

> Pyspark: ISO-8859-1 column names inconsistently converted to UTF-8
> ------------------------------------------------------------------
>
>                 Key: SPARK-21107
>                 URL: https://issues.apache.org/jira/browse/SPARK-21107
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>         Environment: Windows 7 standalone
>            Reporter: Tavis Barr
>            Priority: Minor
>
> When I create a column name with ISO-8859-1 (or possibly, I suspect, other non-UTF-8)
characters in it, they are sometimes converted to UTF-8, sometimes not.
> Examples:
> >>> df = sc.parallelize([[1,2],[1,4],[2,5],[2,6]]).toDF([u"L\xe0",u"Here"])
> >>> df.show()
> +---+----+
> | Là|Here|
> +---+----+
> |  1|   2|
> |  1|   4|
> |  2|   5|
> |  2|   6|
> +---+----+
> >>> df.columns
> ['L\xc3\xa0', 'Here']
> >>> df.select(u'L\xc3\xa0').show()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\pyspark\sql\dataframe.py",
line 992, in select
>     jdf = self._jdf.select(self._jcols(*cols))
>   File "F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py",
line 1133, in __call__
>   File "F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\pyspark\sql\utils.py",
line 69, in deco
>     raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u"cannot resolve '`L\xc3\xa0`' given input columns:
[L\xe0, Here];;\n'Project ['L\xc3\xa0]\n+- LogicalRDD [L\xe0#14L, Here#15L]\n"
> >>> df.select(u'L\xe0').show()
> +---+
> | Là|
> +---+
> |  1|
> |  1|
> |  2|
> |  2|
> +---+
> >>> df.select(u'L\xe0').collect()[0].asDict()
> {'L\xc3\xa0': 1}
> This does not seem to affect the Scala version:
> scala> val df = sc.parallelize(Seq((1,2),(1,4),(2,5),(2,6))).toDF("L\u00e0","Here")
> df: org.apache.spark.sql.DataFrame = [Lα: int, Here: int]
> scala> df.select("L\u00e0").show()
> [...output elided..]
> +---+
> | Là|
> +---+
> |  1|
> |  1|
> |  2|
> |  2|
> +---+
> scala> df.columns(0).map(c => c.toInt )
> res8: scala.collection.immutable.IndexedSeq[Int] = Vector(76, 224)
> [Note that 224 is \u00e0, i.e., the original value]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message