spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harry Brundage (JIRA)" <>
Subject [jira] [Commented] (SPARK-1849) sc.textFile does not support non UTF-8 encodings
Date Thu, 30 Oct 2014 19:12:35 GMT


Harry Brundage commented on SPARK-1849:

What do you guys think about making an `sc.bytesFile` or `sc.charFile` or something like that
which returns byte arrays that let the user then manipulate the bytes to try to recover? I
would think its also worth documenting the fact that `textFile` uses `io.Text` underneath
which assumes UTF-8 and makes decisions for you about how to recover from badly encoded data.
I really think that this is an important issue to alert users about: Hadoop is first and foremost
used for churning through web logs from users visiting web pages, and users and web pages
have dirtily encoded data. 

And Mridul, it's Spark which choses to call `toString` on the returned `Text` objects in `sc.textFile`,
not the `TextInputFormat`. It is `Text.toString()` which assumes UTF-8, but Spark-land deals
with the `Text` objects who still have the underlying bytes available, so its completely within
Spark's control to return broken strings. I get wanting to follow the Hadoop precedent however
which is to unburden the user from needing to care about encodings however. I also bet that
upgrading spark to find that it all of a sudden breaks on a bunch of your data would really
anger a lot of people, so I figure changing `textFile`'s behaviour is outside the realm of
possibility for version 1, and I assume little old me wouldn't be able to change the whole
Hadoop community's thoughts on the matter anyways. 

> sc.textFile does not support non UTF-8 encodings
> ------------------------------------------------
>                 Key: SPARK-1849
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Harry Brundage
>         Attachments: encoding_test
> I'm trying to process a file which isn't valid UTF-8 data inside hadoop using Spark via
{{sc.textFile()}}. Is this possible, and if not, is this a bug that we should fix? It looks
like {{HadoopRDD}} uses {{}} on all the data it ever reads,
which I believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement character,
\uFFFD. Some example code mimicking what {{sc.textFile}} does underneath:
> {code}
> scala> sc.textFile(path).collect()(0)
> res8: String = ?pple
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair
=> pair._2.toString).collect()(0).getBytes()
> res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair
=> pair._2.getBytes).collect()(0)
> res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
> {code}
> In the above example, the first two snippets show the string representation and byte
representation of the example line of text. The string shows a question mark for the replacement
character and the bytes reveal the replacement character has been swapped in by {{Text.toString}}.
The third snippet shows what happens if you call {{getBytes}} on the {{Text}} object which
comes back from hadoop land: we get the real bytes in the file out.
> Now, I think this is a bug, though you may disagree. The text inside my file is perfectly
valid iso-8859-1 encoded bytes, which I would like to be able to rescue and re-encode into
UTF-8, because I want my application to be smart like that. I think Spark should give me the
raw broken string so I can re-encode, but I can't get at the original bytes in order to guess
at what the source encoding might be, as they have already been replaced. I'm dealing with
data from some CDN access logs which are to put it nicely diversely encoded, but I think a
use case Spark should fully support. So, my suggested fix, which I'd like some guidance, is
to change {{textFile}} to spit out broken strings by not using {{Text}}'s UTF-8 encoding.
> Further compounding this issue is that my application is actually in PySpark, but we
can talk about how bytes fly through to Scala land after this if we agree that this is an
issue at all. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message