crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <ever...@nuna.com>
Subject Re: Compress and output formats
Date Mon, 14 Sep 2015 02:17:40 GMT
On Sun, Sep 13, 2015 at 6:03 PM, Josh Wills <josh.wills@gmail.com> wrote:

>
>
> On Sun, Sep 13, 2015 at 10:36 AM, Everett Anderson <everett@nuna.com>
> wrote:
>
>> Hi!
>>
>> On Sat, Sep 12, 2015 at 11:15 PM, Josh Wills <josh.wills@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Sat, Sep 12, 2015 at 2:35 PM, Everett Anderson <everett@nuna.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've got two basic questions about org.apache.crunch.io.Compress
>>>> <https://crunch.apache.org/apidocs/0.12.0/index.html?overview-summary.html>
>>>> .
>>>>
>>>> 1) It seems like it should only be used to wrap Targets that are
>>>> themselves binary file output formats, but org.apache.crunch.io.To
>>>> only has text, avro, and sequence, none of which seem appropriate. How do
>>>> people tend to use this? Is there a Hadoop FileOutputFormat that they give
>>>> to To.formattedFile?
>>>>
>>>
>>> I don't understand the question-- the Compress methods can be used for
>>> any sort of output format that extends FileOutputFormat, it doesn't matter
>>> whether it's text/sequence/avro or a custom thing.
>>>
>>
>> I think I may just not understand how it's to be used.
>>
>> For example, if you do something like this:
>>
>> PCollection<String> data = ...
>>
>> Target baseTarget = To.textFile("out1");
>> Target compressedTarget = Compress.gzip(baseTarget);
>>
>> data.write(compressedTarget);
>>
>> What is the output file supposed to be? Is it a UTF-8 encoded text file
>> of Strings, each of which has been passed through gzip?
>>
>> I'm actually looking for a way to compress each of the part-* output
>> files itself, such that they'd be gzip (or lzo) files that contain text.
>> Does that make sense? Is there an easy wrapper to do that?
>>
>
> I think that what it does now is what you want-- each part-* file is
> gzipped (or snappied, or whatever). Is that not what seems to be happening
> when you run it?
>

Oh! It looks like it does create .gz part files with the MRPipeline, but
with the MemPipeline, which was what I was using to play around with, it
just creates a text file.

Example:

    Pipeline pipeline = MemPipeline.getInstance();
    List<String> dataElements = new ArrayList<>(100);
    for (int i = 0; i < 100; i++) {
      dataElements.add("Test data element");
    }

    PCollection<String> data = pipeline.create(dataElements,
Writables.strings());

    Target baseTarget = To.textFile("out1");
    Target compressedTarget = Compress.gzip(baseTarget);
    data.write(compressedTarget, Target.WriteMode.OVERWRITE);

    pipeline.done();

Results in a out1/out1.txt file which is just plain text.

Switching to the MRPipeline results in a out1/part-m-00000.gz file which
is, indeed, a gzip file.

I'm not sure if this is a bug given the MemPipeline is likely only meant to
be used for unit tests?




>
>
>>
>>
>>
>>>
>>>> 2) The implementation of Compress.gzip is
>>>>
>>>>   public static <T extends Target> T gzip(T target) {
>>>>     return (T) compress(target, GzipCodec.class)
>>>>         .outputConf(*AvroJob.OUTPUT_CODEC*,
>>>> DataFileConstants.DEFLATE_CODEC);
>>>>   }
>>>>
>>>> Does this mean it can only work with Avro?
>>>>
>>>
>>> No, it's just that Avro has its own built-in support for gzip/snappy
>>> serialization and it requires some extra conf to enable it. Any other
>>> output format will just ignore that configuration parameter.
>>>
>>
>> Cool!
>>
>>
>>>
>>>
>>>> Thanks!
>>>>
>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>> may contain information that is confidential, proprietary in nature,
>>>> protected health information (PHI), or otherwise protected by law from
>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>> are not the intended recipient, you are hereby notified that any use,
>>>> disclosure or copying of this email, including any attachments, is
>>>> unauthorized and strictly prohibited. If you have received this email in
>>>> error, please notify the sender of this email. Please delete this and all
>>>> copies of this email from your system. Any opinions either expressed or
>>>> implied in this email and all attachments, are those of its author only,
>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>>
>>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Mime
View raw message