crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Compress and output formats
Date Mon, 14 Sep 2015 01:03:03 GMT
On Sun, Sep 13, 2015 at 10:36 AM, Everett Anderson <everett@nuna.com> wrote:

> Hi!
>
> On Sat, Sep 12, 2015 at 11:15 PM, Josh Wills <josh.wills@gmail.com> wrote:
>
>>
>>
>> On Sat, Sep 12, 2015 at 2:35 PM, Everett Anderson <everett@nuna.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I've got two basic questions about org.apache.crunch.io.Compress
>>> <https://crunch.apache.org/apidocs/0.12.0/index.html?overview-summary.html>
>>> .
>>>
>>> 1) It seems like it should only be used to wrap Targets that are
>>> themselves binary file output formats, but org.apache.crunch.io.To only
>>> has text, avro, and sequence, none of which seem appropriate. How do people
>>> tend to use this? Is there a Hadoop FileOutputFormat that they give to
>>> To.formattedFile?
>>>
>>
>> I don't understand the question-- the Compress methods can be used for
>> any sort of output format that extends FileOutputFormat, it doesn't matter
>> whether it's text/sequence/avro or a custom thing.
>>
>
> I think I may just not understand how it's to be used.
>
> For example, if you do something like this:
>
> PCollection<String> data = ...
>
> Target baseTarget = To.textFile("out1");
> Target compressedTarget = Compress.gzip(baseTarget);
>
> data.write(compressedTarget);
>
> What is the output file supposed to be? Is it a UTF-8 encoded text file of
> Strings, each of which has been passed through gzip?
>
> I'm actually looking for a way to compress each of the part-* output files
> itself, such that they'd be gzip (or lzo) files that contain text. Does
> that make sense? Is there an easy wrapper to do that?
>

I think that what it does now is what you want-- each part-* file is
gzipped (or snappied, or whatever). Is that not what seems to be happening
when you run it?


>
>
>
>>
>>> 2) The implementation of Compress.gzip is
>>>
>>>   public static <T extends Target> T gzip(T target) {
>>>     return (T) compress(target, GzipCodec.class)
>>>         .outputConf(*AvroJob.OUTPUT_CODEC*,
>>> DataFileConstants.DEFLATE_CODEC);
>>>   }
>>>
>>> Does this mean it can only work with Avro?
>>>
>>
>> No, it's just that Avro has its own built-in support for gzip/snappy
>> serialization and it requires some extra conf to enable it. Any other
>> output format will just ignore that configuration parameter.
>>
>
> Cool!
>
>
>>
>>
>>> Thanks!
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>
>>
>>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.
>

Mime
View raw message