spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enno Shioji <eshi...@gmail.com>
Subject Writing and reading sequence file results in trailing extra data
Date Tue, 30 Dec 2014 10:58:30 GMT
Hi, I'm facing a weird issue. Any help appreciated.

When I execute the below code and compare "input" and "output", each record
in the output has some extra trailing data appended to it, and hence
corrupted. I'm just reading and writing, so the input and output should be
exactly the same.

I'm using spark-core 1.2.0_2.10 and the Hadoop bundled in it
(hadoop-common: 2.2.0, hadoop-core: 1.2.1). I also confirmed the binary is
fine at the time it's passed to Hadoop classes, and has already the extra
data when in Hadoop classes (I guess this makes it more of a Hadoop
question...).

Code:
=====
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setMaster("local[4]")
      .setAppName("Simple Application")

    val sc = new SparkContext(conf)

   // input.txt is a text file with some Base64 encoded binaries stored as
lines

    val src = sc
      .textFile("input.txt")
      .map(DatatypeConverter.parseBase64Binary)
      .map(x => (NullWritable.get(), new BytesWritable(x)))
      .saveAsSequenceFile("s3n://fake-test/stored")

    val file = "s3n://fake-test/stored"
    val logData = sc.sequenceFile(file, classOf[NullWritable],
classOf[BytesWritable])

    val count = logData
      .map { case (k, v) => v }
      .map(x => DatatypeConverter.printBase64Binary(x.getBytes))
      .saveAsTextFile("/tmp/output")

  }

ᐧ

Mime
View raw message