crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Writing compressed sequence files
Date Sat, 03 Aug 2013 01:24:43 GMT
Hey Som,

Something seems amiss-- I use this trick in Cloudera ML to handle output
compression, viz.:

https://github.com/cloudera/ml/blob/master/client/src/main/java/com/cloudera/science/ml/client/params/PipelineParameters.java

Can you send me a gist of what you're trying if that doesn't work?

J



On Fri, Aug 2, 2013 at 5:33 PM, Som Satpathy <somsatpathy@gmail.com> wrote:

> Thanks Josh. I tried setting compression parameters via the Configuration
> object and also via command line, but the output sequence file never seems
> to get compressed. I'm trying to Snappy compress it.
>
> If I trying creating a sequence file outside of crunch using
> SequenceFile.createWriter, I see the file getting compressed with my
> compression type (i.e Snappy)
>
> I was wondering if this is a know issue with crunch..
>
> Thanks,
> Som
>
>
> On Fri, Aug 2, 2013 at 4:56 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Hey Som,
>>
>> The Pipeline object that coordinates the flow has a getConfiguration()
>> method where you can set any options you might like and they will propagate
>> to all of your jars.
>>
>> I usually implement Hadoop's Tool interface and then specify these
>> configuration options on the command line so I can play with them
>> independent of the logic of my runtime, and I end up w/something like:
>>
>> hadoop jar <crunch-job.jar> -D mapred.compress.output=true -D
>> mapred.output.compression.type=block etc.
>>
>> I think that having some syntactic sugar for compressing Target objects
>> (like To.sequenceFile or To.avroFile) would be a nice JIRA.
>>
>> J
>>
>>
>> On Fri, Aug 2, 2013 at 3:58 PM, Som Satpathy <somsatpathy@gmail.com>wrote:
>>
>>> Hi all,
>>>
>>> I am trying to write compressed sequence files at the end of my crunch
>>> pipeline. I'm doing a pipeline.write(mycollection, To.sequenceFile(path))
>>> for that.
>>> However, Crunch is writing an uncompressed sequence file by default. How
>>> do I pass the codec that I want to use to Crunch?
>>>
>>> Looking forward for your inputs.
>>>
>>> Thanks,
>>> Som
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message