spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: sc.textFile can't recognize '\004'
Date Sat, 21 Jun 2014 06:22:02 GMT
These are actually Scala / Java questions.

On Sat, Jun 21, 2014 at 1:08 AM, anny9699 <anny9699@gmail.com> wrote:
> 1) One of the separators is '\004', which could be recognized by python or R
> or Hive, however Spark seems can't recognize this one and returns a symbol
> looking like '?'. Also this symbol is not a question mark and I don't know
> how to parse.

(The \004 octal syntax appears deprecated, but it works.) It's not
turned into ?, it is just how the shell shows non-printing characters.

scala> val c = '\004'
warning: there were 1 deprecation warning(s); re-run with -deprecation
for details
c: Char = ?

scala> c.toInt
res2: Int = 4

Which is all correct. Is it presenting any problem?

> 2) Some of the separator are composed of several Chars, like "} =>". If I
> use str.split(Array('}', '=>')), it will separate the string but with many
> white spaces included in the middle. Is there a good way that I could
> separate by String instead of by Array of Chars?

Your example doesn't compile but I assume the argument should be an
array of the 3 chars. String.split will return an empty match between
tokens. If you don't want them, you can
str.split(...).filterNot(_.isEmpty)

Mime
View raw message