incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: pipeline writeTextFile spitting out encoded files?
Date Tue, 12 Feb 2013 19:33:49 GMT
On Tue, Feb 12, 2013 at 11:27 AM, Mike Barretta <mike.barretta@gmail.com>wrote:

> Still debugging this on my end, but to answer your questions:
>
> Compression:
> mapred.map.output.compression.codec =
> org.apache.hadoop.io.compress.DefaultCodec
> mapred.compress.map.output = true
> mapred.output.compress = true
> mapred.output.compression.type = BLOCK
>

Yeah, that looks suspect-- let's try with mapred.output.compress = false.


>
> Groovy:
> Rocks! One thing I'd actually like to see/do would be to translate GPars
> parallelization functions (has functions like map, reduce, filter, size,
> sum, min/max, sort, groupBy, combine:
> http://gpars.org/guide/guide/3.%20Data%20Parallelism.html) into Crunch.
>

Very cool-- Scala's Collections API was the inspiration for a lot of
Scrunch's APIs, same sort of principle applies here.


>
>
> On Tue, Feb 12, 2013 at 12:30 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Okay-- how about the cluster compression settings?
>>
>>
>> On Tue, Feb 12, 2013 at 9:17 AM, Mike Barretta <mike.barretta@gmail.com>wrote:
>>
>>> Josh,
>>>
>>> Did the swap, but got the same result.
>>>
>>> Inside that function() is something like:
>>>
>>> emitter.emit([
>>>     it.id
>>>     it.name,
>>>     it.value
>>> ].join("\t"))
>>>
>>> If it isn't obvious, I'm trying to output some HDFS tables containing
>>> serialized objects to TSV files.  Log statements at that emit line show
>>> that list.join spitting out a clear-text string.
>>>
>>> Thanks,
>>> Mike
>>>
>>>
>>>
>>> On Tue, Feb 12, 2013 at 11:46 AM, Josh Wills <jwills@cloudera.com>wrote:
>>>
>>>> I haven't seen that one before-- I'm assuming there's some code in the
>>>> function() (or in the assembler?) to force the obj (which is a Thrift
>>>> record at some point?) to be a string before it gets emitted.
>>>>
>>>> To make sure it's not a Crunch bug, would you mind writing:
>>>>
>>>> crunchPipeline.write(collection, To.textFile("$outputPath/$outputDir"))
>>>>
>>>> in place of writeTextFile? We do some additional checks in
>>>> writeTextFile and I want to be sure we didn't screw something up.
>>>>
>>>> I'm curious about Hadoop cluster version, and the settings for
>>>> compression on your job/cluster (
>>>> https://ccp.cloudera.com/display/CDHDOC/Snappy+Installation for
>>>> references to the parameters) as well. Also, how do you like Groovy as a
>>>> language for writing Crunch pipelines? I haven't used it since the Grails
>>>> days, but I have some friends at SAS who love it.
>>>>
>>>> Josh
>>>>
>>>>
>>>> On Tue, Feb 12, 2013 at 8:33 AM, Mike Barretta <mike.barretta@gmail.com
>>>> > wrote:
>>>>
>>>>> I'm running some simple parallelDos which emit Strings.  When I write
>>>>> the resulting PCollection out using pipeline.writeTextFile(), I see garbled
>>>>> garbage like:
>>>>>
>>>>>
>>>>> 7?%?Ȳx?B?_L?v(ԭ?,??%?o;;??b-s?aaPXI???O??E;u?%k?????Z7??oD?r???e̼rX??/????)??Ƥ?r3l?R-}?+?!*??@!??Q?6?N??=????????v*B?=H??!0?ve??b?d?uZ7??4?H?i??uw??‹)Pxy
>>>>> ?n%?kۣ???v??xaI?wæ??v^?2i?<93\?G?“???N?
>>>>> ??}?/?EG??mK??*?9;vG??Sb?_L??XD?U?M?ݤo?U??c???qwa?q?ԫ.?9??(????H?o?3i|?7Į????B??n?%?\?uxw??Μ???̢??)-?S?su??Ҁ:?????ݹ??#)??V?7?!???????R?>???EZ}v??8ɿ?????ަ%?~?W?pi?|}?/d#??nr?\a?FUh?Yߠ?|sf%M
>>>>>                      v)S??. 4$???
>>>>>
>>>>>  3T???^?*?#I????bҀࡑ???x??%?f?Ў??U???h??,~H?T=O
>>>>>
>>>>>         ??z]JWt?q?B?e2?
>>>>>
>>>>> The code (Groovy - function() is a passed in closure that does the
>>>>> emitter.emit()) looks like:
>>>>>
>>>>> collection.parallelDo(this.class.name + ":" + table, new
>>>>> DoFn<Pair<ColumnKey, ColumnDataArrayWritable>, String>()
{
>>>>>   @Override
>>>>>   void process(Pair<ColumnKey, ColumnDataArrayWritable> input,
>>>>> Emitter<String> emitter) {
>>>>>     input.second().toArray().each {
>>>>>       def obj = assembler.assemble([PetalUtils.toThrift(input.first(),
>>>>> it)])
>>>>>       function(obj, emitter)
>>>>>     }
>>>>>   }
>>>>> }, Writables.strings())
>>>>> crunchPipeline.writeTextFile(collection, "$outputPath/$outputDir")
>>>>>
>>>>> It's worth noting I saw the same output when running plain word count.
>>>>>
>>>>> Is this something that's my fault? Or the cluster, cluster
>>>>> compression, etc?
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message