crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Barretta <mike.barre...@gmail.com>
Subject pipeline writeTextFile spitting out encoded files?
Date Tue, 12 Feb 2013 16:33:15 GMT
I'm running some simple parallelDos which emit Strings.  When I write the
resulting PCollection out using pipeline.writeTextFile(), I see garbled
garbage like:

7?%?Ȳx?B?_L?v(ԭ?,??%?o;;??b-s?aaPXI???O??E;u?%k?????Z7??oD?r???e̼rX??/????)??Ƥ?r3l?R-}?+?!*??@!??Q?6?N??=????????v*B?=H??!0?ve??b?d?uZ7??4?H?i??uw??‹)Pxy
?n%?kۣ???v??xaI?wæ??v^?2i?<93\?G?“???N?
??}?/?EG??mK??*?9;vG??Sb?_L??XD?U?M?ݤo?U??c???qwa?q?ԫ.?9??(????H?o?3i|?7Į????B??n?%?\?uxw??Μ???̢??)-?S?su??Ҁ:?????ݹ??#)??V?7?!???????R?>???EZ}v??8ɿ?????ަ%?~?W?pi?|}?/d#??nr?\a?FUh?Yߠ?|sf%M
                     v)S??.4$???

 3T???^?*?#I????bҀࡑ???x??%?f?Ў??U???h??,~H?T=O

  ??z]JWt?q?B?e2?

The code (Groovy - function() is a passed in closure that does the
emitter.emit()) looks like:

collection.parallelDo(this.class.name + ":" + table, new
DoFn<Pair<ColumnKey, ColumnDataArrayWritable>, String>() {
  @Override
  void process(Pair<ColumnKey, ColumnDataArrayWritable> input,
Emitter<String> emitter) {
    input.second().toArray().each {
      def obj = assembler.assemble([PetalUtils.toThrift(input.first(), it)])
      function(obj, emitter)
    }
  }
}, Writables.strings())
crunchPipeline.writeTextFile(collection, "$outputPath/$outputDir")

It's worth noting I saw the same output when running plain word count.

Is this something that's my fault? Or the cluster, cluster compression, etc?

Mime
View raw message