crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: pipeline writeTextFile spitting out encoded files?
Date Tue, 12 Feb 2013 16:46:44 GMT
I haven't seen that one before-- I'm assuming there's some code in the
function() (or in the assembler?) to force the obj (which is a Thrift
record at some point?) to be a string before it gets emitted.

To make sure it's not a Crunch bug, would you mind writing:

crunchPipeline.write(collection, To.textFile("$outputPath/$outputDir"))

in place of writeTextFile? We do some additional checks in writeTextFile
and I want to be sure we didn't screw something up.

I'm curious about Hadoop cluster version, and the settings for compression
on your job/cluster (
https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation for references
to the parameters) as well. Also, how do you like Groovy as a language for
writing Crunch pipelines? I haven't used it since the Grails days, but I
have some friends at SAS who love it.

Josh


On Tue, Feb 12, 2013 at 8:33 AM, Mike Barretta <mike.barretta@gmail.com>wrote:

> I'm running some simple parallelDos which emit Strings.  When I write the
> resulting PCollection out using pipeline.writeTextFile(), I see garbled
> garbage like:
>
>
> 7?%?Ȳx?B?_L?v(ԭ?,??%?o;;??b-s?aaPXI???O??E;u?%k?????Z7??oD?r???e̼rX??/????)??Ƥ?r3l?R-}?+?!*??@!??Q?6?N??=????????v*B?=H??!0?ve??b?d?uZ7??4?H?i??uw??‹)Pxy
> ?n%?kۣ???v??xaI?wæ??v^?2i?<93\?G?“???N?
> ??}?/?EG??mK??*?9;vG??Sb?_L??XD?U?M?ݤo?U??c???qwa?q?ԫ.?9??(????H?o?3i|?7Į????B??n?%?\?uxw??Μ???̢??)-?S?su??Ҁ:?????ݹ??#)??V?7?!???????R?>???EZ}v??8ɿ?????ަ%?~?W?pi?|}?/d#??nr?\a?FUh?Yߠ?|sf%M
>                      v)S??. 4$???
>
>  3T???^?*?#I????bҀࡑ???x??%?f?Ў??U???h??,~H?T=O
>
>     ??z]JWt?q?B?e2?
>
> The code (Groovy - function() is a passed in closure that does the
> emitter.emit()) looks like:
>
> collection.parallelDo(this.class.name + ":" + table, new
> DoFn<Pair<ColumnKey, ColumnDataArrayWritable>, String>() {
>   @Override
>   void process(Pair<ColumnKey, ColumnDataArrayWritable> input,
> Emitter<String> emitter) {
>     input.second().toArray().each {
>       def obj = assembler.assemble([PetalUtils.toThrift(input.first(),
> it)])
>       function(obj, emitter)
>     }
>   }
> }, Writables.strings())
> crunchPipeline.writeTextFile(collection, "$outputPath/$outputDir")
>
> It's worth noting I saw the same output when running plain word count.
>
> Is this something that's my fault? Or the cluster, cluster compression,
> etc?
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message