spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-23723) New charset option for json datasource
Date Sat, 17 Mar 2018 13:35:00 GMT


Apache Spark reassigned SPARK-23723:

    Assignee: Apache Spark

> New charset option for json datasource
> --------------------------------------
>                 Key: SPARK-23723
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Maxim Gekk
>            Assignee: Apache Spark
>            Priority: Major
> Currently JSON Reader can read json files in different charset/encodings. The JSON Reader
uses the jackson-json library to automatically detect the charset of input text/stream. Here
you can see the method which detects encoding: []
> The detectEncoding method checks the BOM ([])
at the beginning of a text. The BOM can be in the file but it is not mandatory. If it is not
present, the auto detection mechanism can select wrong charset. And as a consequence of that,
the user cannot read the json file. *The proposed option will allow to bypass the auto detection
mechanism and set the charset explicitly.*
> The charset option is already exposed as a CSV option: [] .
I propose to add the same option for JSON.
> Regarding to JSON Writer, *the charset option will give to the user opportunity* to
read json files in charset different from UTF-8, modify the dataset and *write results back
to json files in the original encoding.* At the moment it is not possible to do because the
result can be saved in UTF-8 only.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message