crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Barretta <mike.barre...@gmail.com>
Subject Re: pipeline writeTextFile spitting out encoded files?
Date Tue, 12 Feb 2013 17:17:43 GMT
Josh,

Did the swap, but got the same result.

Inside that function() is something like:

emitter.emit([
    it.id
    it.name,
    it.value
].join("\t"))

If it isn't obvious, I'm trying to output some HDFS tables containing
serialized objects to TSV files.  Log statements at that emit line show
that list.join spitting out a clear-text string.

Thanks,
Mike



On Tue, Feb 12, 2013 at 11:46 AM, Josh Wills <jwills@cloudera.com> wrote:

> I haven't seen that one before-- I'm assuming there's some code in the
> function() (or in the assembler?) to force the obj (which is a Thrift
> record at some point?) to be a string before it gets emitted.
>
> To make sure it's not a Crunch bug, would you mind writing:
>
> crunchPipeline.write(collection, To.textFile("$outputPath/$outputDir"))
>
> in place of writeTextFile? We do some additional checks in writeTextFile
> and I want to be sure we didn't screw something up.
>
> I'm curious about Hadoop cluster version, and the settings for compression
> on your job/cluster (
> https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation for
> references to the parameters) as well. Also, how do you like Groovy as a
> language for writing Crunch pipelines? I haven't used it since the Grails
> days, but I have some friends at SAS who love it.
>
> Josh
>
>
> On Tue, Feb 12, 2013 at 8:33 AM, Mike Barretta <mike.barretta@gmail.com>wrote:
>
>> I'm running some simple parallelDos which emit Strings.  When I write the
>> resulting PCollection out using pipeline.writeTextFile(), I see garbled
>> garbage like:
>>
>>
>> 7?%?Ȳx?B?_L?v(ԭ?,??%?o;;??b-s?aaPXI???O??E;u?%k?????Z7??oD?r???e̼rX??/????)??Ƥ?r3l?R-}?+?!*??@!??Q?6?N??=????????v*B?=H??!0?ve??b?d?uZ7??4?H?i??uw??‹)Pxy
>> ?n%?kۣ???v??xaI?wæ??v^?2i?<93\?G?“???N?
>> ??}?/?EG??mK??*?9;vG??Sb?_L??XD?U?M?ݤo?U??c???qwa?q?ԫ.?9??(????H?o?3i|?7Į????B??n?%?\?uxw??Μ???̢??)-?S?su??Ҁ:?????ݹ??#)??V?7?!???????R?>???EZ}v??8ɿ?????ަ%?~?W?pi?|}?/d#??nr?\a?FUh?Yߠ?|sf%M
>>                      v)S??. 4$???
>>
>>  3T???^?*?#I????bҀࡑ???x??%?f?Ў??U???h??,~H?T=O
>>
>>     ??z]JWt?q?B?e2?
>>
>> The code (Groovy - function() is a passed in closure that does the
>> emitter.emit()) looks like:
>>
>> collection.parallelDo(this.class.name + ":" + table, new
>> DoFn<Pair<ColumnKey, ColumnDataArrayWritable>, String>() {
>>   @Override
>>   void process(Pair<ColumnKey, ColumnDataArrayWritable> input,
>> Emitter<String> emitter) {
>>     input.second().toArray().each {
>>       def obj = assembler.assemble([PetalUtils.toThrift(input.first(),
>> it)])
>>       function(obj, emitter)
>>     }
>>   }
>> }, Writables.strings())
>> crunchPipeline.writeTextFile(collection, "$outputPath/$outputDir")
>>
>> It's worth noting I saw the same output when running plain word count.
>>
>> Is this something that's my fault? Or the cluster, cluster compression,
>> etc?
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message