crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Som Satpathy <somsatpa...@gmail.com>
Subject Re: Writing compressed sequence files
Date Sat, 03 Aug 2013 02:47:42 GMT
    It worked! Thanks Josh, appreciate it.

All I had to do is:

    conf.setBoolean("mapred.output.compress", true);

    conf.set("mapred.output.compression.type", "BLOCK");

    conf.setClass("mapred.output.compression.codec", SnappyCodec.class,
CompressionCodec.class);


instead of:

    conf.set("mapred.compress.output", "true");

    conf.set("mapred.output.compression.type", "BLOCK");

    conf.set("mapred.output.compression.codec",
"org.apache.hadoop.io.compress.SnappyCodec");



On Fri, Aug 2, 2013 at 6:24 PM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Som,
>
> Something seems amiss-- I use this trick in Cloudera ML to handle output
> compression, viz.:
>
>
> https://github.com/cloudera/ml/blob/master/client/src/main/java/com/cloudera/science/ml/client/params/PipelineParameters.java
>
> Can you send me a gist of what you're trying if that doesn't work?
>
> J
>
>
>
> On Fri, Aug 2, 2013 at 5:33 PM, Som Satpathy <somsatpathy@gmail.com>wrote:
>
>> Thanks Josh. I tried setting compression parameters via the Configuration
>> object and also via command line, but the output sequence file never seems
>> to get compressed. I'm trying to Snappy compress it.
>>
>> If I trying creating a sequence file outside of crunch using
>> SequenceFile.createWriter, I see the file getting compressed with my
>> compression type (i.e Snappy)
>>
>> I was wondering if this is a know issue with crunch..
>>
>> Thanks,
>> Som
>>
>>
>> On Fri, Aug 2, 2013 at 4:56 PM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> Hey Som,
>>>
>>> The Pipeline object that coordinates the flow has a getConfiguration()
>>> method where you can set any options you might like and they will propagate
>>> to all of your jars.
>>>
>>> I usually implement Hadoop's Tool interface and then specify these
>>> configuration options on the command line so I can play with them
>>> independent of the logic of my runtime, and I end up w/something like:
>>>
>>> hadoop jar <crunch-job.jar> -D mapred.compress.output=true -D
>>> mapred.output.compression.type=block etc.
>>>
>>> I think that having some syntactic sugar for compressing Target objects
>>> (like To.sequenceFile or To.avroFile) would be a nice JIRA.
>>>
>>> J
>>>
>>>
>>> On Fri, Aug 2, 2013 at 3:58 PM, Som Satpathy <somsatpathy@gmail.com>wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am trying to write compressed sequence files at the end of my crunch
>>>> pipeline. I'm doing a pipeline.write(mycollection, To.sequenceFile(path))
>>>> for that.
>>>> However, Crunch is writing an uncompressed sequence file by default.
>>>> How do I pass the codec that I want to use to Crunch?
>>>>
>>>> Looking forward for your inputs.
>>>>
>>>> Thanks,
>>>> Som
>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message