incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Barretta <mike.barre...@gmail.com>
Subject Re: pipeline writeTextFile spitting out encoded files?
Date Tue, 12 Feb 2013 19:27:27 GMT
Still debugging this on my end, but to answer your questions:

Compression:
mapred.map.output.compression.codec =
org.apache.hadoop.io.compress.DefaultCodec
mapred.compress.map.output = true
mapred.output.compress = true
mapred.output.compression.type = BLOCK

Groovy:
Rocks! One thing I'd actually like to see/do would be to translate GPars
parallelization functions (has functions like map, reduce, filter, size,
sum, min/max, sort, groupBy, combine:
http://gpars.org/guide/guide/3.%20Data%20Parallelism.html) into Crunch.


On Tue, Feb 12, 2013 at 12:30 PM, Josh Wills <jwills@cloudera.com> wrote:

> Okay-- how about the cluster compression settings?
>
>
> On Tue, Feb 12, 2013 at 9:17 AM, Mike Barretta <mike.barretta@gmail.com>wrote:
>
>> Josh,
>>
>> Did the swap, but got the same result.
>>
>> Inside that function() is something like:
>>
>> emitter.emit([
>>     it.id
>>     it.name,
>>     it.value
>> ].join("\t"))
>>
>> If it isn't obvious, I'm trying to output some HDFS tables containing
>> serialized objects to TSV files.  Log statements at that emit line show
>> that list.join spitting out a clear-text string.
>>
>> Thanks,
>> Mike
>>
>>
>>
>> On Tue, Feb 12, 2013 at 11:46 AM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> I haven't seen that one before-- I'm assuming there's some code in the
>>> function() (or in the assembler?) to force the obj (which is a Thrift
>>> record at some point?) to be a string before it gets emitted.
>>>
>>> To make sure it's not a Crunch bug, would you mind writing:
>>>
>>> crunchPipeline.write(collection, To.textFile("$outputPath/$outputDir"))
>>>
>>> in place of writeTextFile? We do some additional checks in writeTextFile
>>> and I want to be sure we didn't screw something up.
>>>
>>> I'm curious about Hadoop cluster version, and the settings for
>>> compression on your job/cluster (
>>> https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation for
>>> references to the parameters) as well. Also, how do you like Groovy as a
>>> language for writing Crunch pipelines? I haven't used it since the Grails
>>> days, but I have some friends at SAS who love it.
>>>
>>> Josh
>>>
>>>
>>> On Tue, Feb 12, 2013 at 8:33 AM, Mike Barretta <mike.barretta@gmail.com>wrote:
>>>
>>>> I'm running some simple parallelDos which emit Strings.  When I write
>>>> the resulting PCollection out using pipeline.writeTextFile(), I see garbled
>>>> garbage like:
>>>>
>>>>
>>>> 7?%?Ȳx?B?_L?v(ԭ?,??%?o;;??b-s?aaPXI???O??E;u?%k?????Z7??oD?r???e̼rX??/????)??Ƥ?r3l?R-}?+?!*??@!??Q?6?N??=????????v*B?=H??!0?ve??b?d?uZ7??4?H?i??uw??‹)Pxy
>>>> ?n%?kۣ???v??xaI?wæ??v^?2i?<93\?G?“???N?
>>>> ??}?/?EG??mK??*?9;vG??Sb?_L??XD?U?M?ݤo?U??c???qwa?q?ԫ.?9??(????H?o?3i|?7Į????B??n?%?\?uxw??Μ???̢??)-?S?su??Ҁ:?????ݹ??#)??V?7?!???????R?>???EZ}v??8ɿ?????ަ%?~?W?pi?|}?/d#??nr?\a?FUh?Yߠ?|sf%M
>>>>                      v)S??. 4$???
>>>>
>>>>  3T???^?*?#I????bҀࡑ???x??%?f?Ў??U???h??,~H?T=O
>>>>
>>>>       ??z]JWt?q?B?e2?
>>>>
>>>> The code (Groovy - function() is a passed in closure that does the
>>>> emitter.emit()) looks like:
>>>>
>>>> collection.parallelDo(this.class.name + ":" + table, new
>>>> DoFn<Pair<ColumnKey, ColumnDataArrayWritable>, String>() {
>>>>   @Override
>>>>   void process(Pair<ColumnKey, ColumnDataArrayWritable> input,
>>>> Emitter<String> emitter) {
>>>>     input.second().toArray().each {
>>>>       def obj = assembler.assemble([PetalUtils.toThrift(input.first(),
>>>> it)])
>>>>       function(obj, emitter)
>>>>     }
>>>>   }
>>>> }, Writables.strings())
>>>> crunchPipeline.writeTextFile(collection, "$outputPath/$outputDir")
>>>>
>>>> It's worth noting I saw the same output when running plain word count.
>>>>
>>>> Is this something that's my fault? Or the cluster, cluster compression,
>>>> etc?
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message