spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HanCheol Cho (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters
Date Fri, 14 Apr 2017 07:05:41 GMT
HanCheol Cho created SPARK-20336:
------------------------------------

             Summary: spark.read.csv() with wholeFile=True option fails to read non ASCII
unicode characters
                 Key: SPARK-20336
                 URL: https://issues.apache.org/jira/browse/SPARK-20336
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.0
         Environment: Spark 2.2.0 (master branch is downloaded from Github)
PySpark
            Reporter: HanCheol Cho


I used spark.read.csv() method with wholeFile=True option to load data that has multi-line
records.
However, non-ASCII characters are not properly loaded.

The following is a sample data for test:
{code:none}
col1,col2,col3
1,a,text
2,b,テキスト
3,c,텍스트
4,d,"text
テキスト
텍스트
5,e,last
{code}

When it is loaded without wholeFile=True option, non-ASCII characters are shown correctly
although multi-line records are parsed incorrectly as follows:
{code:none}
testdf_default = spark.read.csv("test.encoding.csv", header=True)
testdf_default.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
| 텍스트|null|null|
|   5|   e|last|
+----+----+----+
{code}

When wholeFile=True option is used, non-ASCII characters are broken as follows:
{code:none}
testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True)
testdf_wholefile.show()
+----+----+--------------------+
|col1|col2|                col3|
+----+----+--------------------+
|   1|   a|                text|
|   2|   b|        ������������|
|   3|   c|           ���������|
|   4|   d|text
������������...|
+----+----+--------------------+
{code}

The result is same even if I use encoding="UTF-8" option with wholeFile=True.






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message