crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Compress and output formats
Date Mon, 14 Sep 2015 02:40:14 GMT
Yeah, more like laziness on the part of whoever wrote the MemPipeline impl.
;)
On Sun, Sep 13, 2015 at 7:17 PM Everett Anderson <everett@nuna.com> wrote:

> On Sun, Sep 13, 2015 at 6:03 PM, Josh Wills <josh.wills@gmail.com> wrote:
>
>>
>>
>> On Sun, Sep 13, 2015 at 10:36 AM, Everett Anderson <everett@nuna.com>
>> wrote:
>>
>>> Hi!
>>>
>>> On Sat, Sep 12, 2015 at 11:15 PM, Josh Wills <josh.wills@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Sat, Sep 12, 2015 at 2:35 PM, Everett Anderson <everett@nuna.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've got two basic questions about org.apache.crunch.io.Compress
>>>>> <https://crunch.apache.org/apidocs/0.12.0/index.html?overview-summary.html>
>>>>> .
>>>>>
>>>>> 1) It seems like it should only be used to wrap Targets that are
>>>>> themselves binary file output formats, but org.apache.crunch.io.To
>>>>> only has text, avro, and sequence, none of which seem appropriate. How
do
>>>>> people tend to use this? Is there a Hadoop FileOutputFormat that they
give
>>>>> to To.formattedFile?
>>>>>
>>>>
>>>> I don't understand the question-- the Compress methods can be used for
>>>> any sort of output format that extends FileOutputFormat, it doesn't matter
>>>> whether it's text/sequence/avro or a custom thing.
>>>>
>>>
>>> I think I may just not understand how it's to be used.
>>>
>>> For example, if you do something like this:
>>>
>>> PCollection<String> data = ...
>>>
>>> Target baseTarget = To.textFile("out1");
>>> Target compressedTarget = Compress.gzip(baseTarget);
>>>
>>> data.write(compressedTarget);
>>>
>>> What is the output file supposed to be? Is it a UTF-8 encoded text file
>>> of Strings, each of which has been passed through gzip?
>>>
>>> I'm actually looking for a way to compress each of the part-* output
>>> files itself, such that they'd be gzip (or lzo) files that contain text.
>>> Does that make sense? Is there an easy wrapper to do that?
>>>
>>
>> I think that what it does now is what you want-- each part-* file is
>> gzipped (or snappied, or whatever). Is that not what seems to be happening
>> when you run it?
>>
>
> Oh! It looks like it does create .gz part files with the MRPipeline, but
> with the MemPipeline, which was what I was using to play around with, it
> just creates a text file.
>
> Example:
>
>     Pipeline pipeline = MemPipeline.getInstance();
>     List<String> dataElements = new ArrayList<>(100);
>     for (int i = 0; i < 100; i++) {
>       dataElements.add("Test data element");
>     }
>
>     PCollection<String> data = pipeline.create(dataElements,
> Writables.strings());
>
>     Target baseTarget = To.textFile("out1");
>     Target compressedTarget = Compress.gzip(baseTarget);
>     data.write(compressedTarget, Target.WriteMode.OVERWRITE);
>
>     pipeline.done();
>
> Results in a out1/out1.txt file which is just plain text.
>
> Switching to the MRPipeline results in a out1/part-m-00000.gz file which
> is, indeed, a gzip file.
>
> I'm not sure if this is a bug given the MemPipeline is likely only meant
> to be used for unit tests?
>
>
>
>
>>
>>
>>>
>>>
>>>
>>>>
>>>>> 2) The implementation of Compress.gzip is
>>>>>
>>>>>   public static <T extends Target> T gzip(T target) {
>>>>>     return (T) compress(target, GzipCodec.class)
>>>>>         .outputConf(*AvroJob.OUTPUT_CODEC*,
>>>>> DataFileConstants.DEFLATE_CODEC);
>>>>>   }
>>>>>
>>>>> Does this mean it can only work with Avro?
>>>>>
>>>>
>>>> No, it's just that Avro has its own built-in support for gzip/snappy
>>>> serialization and it requires some extra conf to enable it. Any other
>>>> output format will just ignore that configuration parameter.
>>>>
>>>
>>> Cool!
>>>
>>>
>>>>
>>>>
>>>>> Thanks!
>>>>>
>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>> may contain information that is confidential, proprietary in nature,
>>>>> protected health information (PHI), or otherwise protected by law from
>>>>> disclosure, and is solely for the use of the intended recipient(s). If
you
>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>> disclosure or copying of this email, including any attachments, is
>>>>> unauthorized and strictly prohibited. If you have received this email
in
>>>>> error, please notify the sender of this email. Please delete this and
all
>>>>> copies of this email from your system. Any opinions either expressed
or
>>>>> implied in this email and all attachments, are those of its author only,
>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>
>>>>
>>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>
>>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

Mime
View raw message