Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 37A3717760 for ; Mon, 14 Sep 2015 02:17:51 +0000 (UTC) Received: (qmail 42786 invoked by uid 500); 14 Sep 2015 02:17:51 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 42744 invoked by uid 500); 14 Sep 2015 02:17:51 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 42734 invoked by uid 99); 14 Sep 2015 02:17:51 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Sep 2015 02:17:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 96E2F1A18C5 for ; Mon, 14 Sep 2015 02:17:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=nuna.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id dT8N5hZSEDOq for ; Mon, 14 Sep 2015 02:17:42 +0000 (UTC) Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id D9E4B203A0 for ; Mon, 14 Sep 2015 02:17:41 +0000 (UTC) Received: by wicgb1 with SMTP id gb1so122446491wic.1 for ; Sun, 13 Sep 2015 19:17:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nuna.com; s=nuna; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=BK2cCSGXucLaIFj0Y1IZMKy55wRDqJRdamZi3XtM4+w=; b=j49iM14GaapvlmjteJn6HevgvXWtEf1omN0TAciaLSXpbioQfFm6gUfzjEPGE2fNxG xsCaqX42+KqIUQcU2IAfYoWdxOPTCRaeCDd2ywVF/+G4691JGRKM+Dvcbt6txzxnLYxV 93Yk1I/3iDegMyKCsAplC4o4uBw9DxI4AS9co= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=BK2cCSGXucLaIFj0Y1IZMKy55wRDqJRdamZi3XtM4+w=; b=VhDrAFOIX4KIw510O00VdigVnYA8bHUh5H9SFpTPGVThF1ZNA/giK2pMLYjAKdV9Ff hYMztOSiAvpE2yGKKLZ7P6kgkzYJV1Wl4EBIV25pa7CLX96RNaudTXhZUz2LEiF1ptch 9bogu+5kuqKTrn/3MauwnNtJhrS9H9LjYqFqG/F4vnNPzQo8KU3/JLPHnqadjt0aqSYv MV4/Iz93w+hrrj7+Z53UafUz9Z9UR31v17BXYzttMCRuXs8H2i4li5Oiyu0acozk/7mh cr9E7Grw4IS1c1dSQ4znm3ooLZ4gIcLeFkMZZsTgS5bSae872MjNQba8L7AM8rNbHkm9 is9A== X-Gm-Message-State: ALoCoQkRcS2FWm/YVX6acLZpXUUh5vsLCBQ2v3JCOGtQ9RUF7ScYcpBf5TWRrlsm61oLYcf8eAXayJSt9ihrj7c4q5SIFD84CXq9Bwj3c12+NL+Wss3Col7N37Kr7z7uQpbuB0MIB1ap MIME-Version: 1.0 X-Received: by 10.194.24.196 with SMTP id w4mr23019377wjf.137.1442197061040; Sun, 13 Sep 2015 19:17:41 -0700 (PDT) Received: by 10.194.18.200 with HTTP; Sun, 13 Sep 2015 19:17:40 -0700 (PDT) In-Reply-To: References: Date: Sun, 13 Sep 2015 19:17:40 -0700 Message-ID: Subject: Re: Compress and output formats From: Everett Anderson To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=047d7b450b108514fe051fababb2 --047d7b450b108514fe051fababb2 Content-Type: text/plain; charset=UTF-8 On Sun, Sep 13, 2015 at 6:03 PM, Josh Wills wrote: > > > On Sun, Sep 13, 2015 at 10:36 AM, Everett Anderson > wrote: > >> Hi! >> >> On Sat, Sep 12, 2015 at 11:15 PM, Josh Wills >> wrote: >> >>> >>> >>> On Sat, Sep 12, 2015 at 2:35 PM, Everett Anderson >>> wrote: >>> >>>> Hi, >>>> >>>> I've got two basic questions about org.apache.crunch.io.Compress >>>> >>>> . >>>> >>>> 1) It seems like it should only be used to wrap Targets that are >>>> themselves binary file output formats, but org.apache.crunch.io.To >>>> only has text, avro, and sequence, none of which seem appropriate. How do >>>> people tend to use this? Is there a Hadoop FileOutputFormat that they give >>>> to To.formattedFile? >>>> >>> >>> I don't understand the question-- the Compress methods can be used for >>> any sort of output format that extends FileOutputFormat, it doesn't matter >>> whether it's text/sequence/avro or a custom thing. >>> >> >> I think I may just not understand how it's to be used. >> >> For example, if you do something like this: >> >> PCollection data = ... >> >> Target baseTarget = To.textFile("out1"); >> Target compressedTarget = Compress.gzip(baseTarget); >> >> data.write(compressedTarget); >> >> What is the output file supposed to be? Is it a UTF-8 encoded text file >> of Strings, each of which has been passed through gzip? >> >> I'm actually looking for a way to compress each of the part-* output >> files itself, such that they'd be gzip (or lzo) files that contain text. >> Does that make sense? Is there an easy wrapper to do that? >> > > I think that what it does now is what you want-- each part-* file is > gzipped (or snappied, or whatever). Is that not what seems to be happening > when you run it? > Oh! It looks like it does create .gz part files with the MRPipeline, but with the MemPipeline, which was what I was using to play around with, it just creates a text file. Example: Pipeline pipeline = MemPipeline.getInstance(); List dataElements = new ArrayList<>(100); for (int i = 0; i < 100; i++) { dataElements.add("Test data element"); } PCollection data = pipeline.create(dataElements, Writables.strings()); Target baseTarget = To.textFile("out1"); Target compressedTarget = Compress.gzip(baseTarget); data.write(compressedTarget, Target.WriteMode.OVERWRITE); pipeline.done(); Results in a out1/out1.txt file which is just plain text. Switching to the MRPipeline results in a out1/part-m-00000.gz file which is, indeed, a gzip file. I'm not sure if this is a bug given the MemPipeline is likely only meant to be used for unit tests? > > >> >> >> >>> >>>> 2) The implementation of Compress.gzip is >>>> >>>> public static T gzip(T target) { >>>> return (T) compress(target, GzipCodec.class) >>>> .outputConf(*AvroJob.OUTPUT_CODEC*, >>>> DataFileConstants.DEFLATE_CODEC); >>>> } >>>> >>>> Does this mean it can only work with Avro? >>>> >>> >>> No, it's just that Avro has its own built-in support for gzip/snappy >>> serialization and it requires some extra conf to enable it. Any other >>> output format will just ignore that configuration parameter. >>> >> >> Cool! >> >> >>> >>> >>>> Thanks! >>>> >>>> *DISCLAIMER:* The contents of this email, including any attachments, >>>> may contain information that is confidential, proprietary in nature, >>>> protected health information (PHI), or otherwise protected by law from >>>> disclosure, and is solely for the use of the intended recipient(s). If you >>>> are not the intended recipient, you are hereby notified that any use, >>>> disclosure or copying of this email, including any attachments, is >>>> unauthorized and strictly prohibited. If you have received this email in >>>> error, please notify the sender of this email. Please delete this and all >>>> copies of this email from your system. Any opinions either expressed or >>>> implied in this email and all attachments, are those of its author only, >>>> and do not necessarily reflect those of Nuna Health, Inc. >>> >>> >>> >> >> *DISCLAIMER:* The contents of this email, including any attachments, may >> contain information that is confidential, proprietary in nature, protected >> health information (PHI), or otherwise protected by law from disclosure, >> and is solely for the use of the intended recipient(s). If you are not the >> intended recipient, you are hereby notified that any use, disclosure or >> copying of this email, including any attachments, is unauthorized and >> strictly prohibited. If you have received this email in error, please >> notify the sender of this email. Please delete this and all copies of this >> email from your system. Any opinions either expressed or implied in this >> email and all attachments, are those of its author only, and do not >> necessarily reflect those of Nuna Health, Inc. >> > > -- *DISCLAIMER:* The contents of this email, including any attachments, may contain information that is confidential, proprietary in nature, protected health information (PHI), or otherwise protected by law from disclosure, and is solely for the use of the intended recipient(s). If you are not the intended recipient, you are hereby notified that any use, disclosure or copying of this email, including any attachments, is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender of this email. Please delete this and all copies of this email from your system. Any opinions either expressed or implied in this email and all attachments, are those of its author only, and do not necessarily reflect those of Nuna Health, Inc. --047d7b450b108514fe051fababb2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Sun, Sep 13, 2015 at 6:03 PM, Josh Wills <josh.wills@gmail.com> wrote:


On Sun, Sep 1= 3, 2015 at 10:36 AM, Everett Anderson <everett@nuna.com> wrot= e:
Hi!

<= div class=3D"gmail_quote">On Sat, Sep 12, 2015 at 11:15 PM, Josh Will= s <josh.wills@gmail.com> wrote:


<= span>On Sat, Sep 12, 2015 at 2:35 PM, Everett Anderson &l= t;everett@nuna.com> wrote:
Hi,

=
1)= It seems like it should only be used to wrap Targets that are themselves b= inary file output formats, but org.apache.crunch.io.To only has text, avro, and seque= nce, none of which seem appropriate. How do people tend to use this? Is the= re a Hadoop FileOutputFormat that they give to To.formattedFile?

I don't understand the questio= n-- the Compress methods can be used for any sort of output format that ext= ends FileOutputFormat, it doesn't matter whether it's text/sequence= /avro or a custom thing.=C2=A0

I think I may just not understand how it's to be= used.

For example, if you do something like this:=

PCollection<String> data =3D ...
<= div>
Target baseTarget =3D To.textFile("out1");
=
Target compressedTarget =3D Compress.gzip(baseTarget);
=

data.write(compressedTarget);

<= div>What is the output file supposed to be? Is it a UTF-8 encoded text file= of Strings, each of which has been passed through gzip?

I'm actually looking for a way to compress each of the part-* ou= tput files itself, such that they'd be gzip (or lzo) files that contain= text. Does that make sense? Is there an easy wrapper to do that?

I think that what it = does now is what you want-- each part-* file is gzipped (or snappied, or wh= atever). Is that not what seems to be happening when you run it?

Oh! It looks like it does cre= ate .gz part files with the MRPipeline, but with the MemPipeline, which was= what I was using to play around with, it just creates a text file.

Example:

=C2=A0 =C2=A0 Pipe= line pipeline =3D MemPipeline.getInstance();
=C2=A0 =C2=A0 List&l= t;String> dataElements =3D new ArrayList<>(100);
=C2=A0 = =C2=A0 for (int i =3D 0; i < 100; i++) {
=C2=A0 =C2=A0 =C2=A0 = dataElements.add("Test data element");
=C2=A0 =C2=A0 }<= /div>

=C2=A0 =C2=A0 PCollection<String> data =3D p= ipeline.create(dataElements, Writables.strings());

=C2=A0 =C2=A0 Target baseTarget =3D To.textFile("out1");
=C2=A0 =C2=A0 Target compressedTarget =3D Compress.gzip(baseTarget);
=C2=A0 =C2=A0 data.write(compressedTarget, Target.WriteMode.OVERWRIT= E);

=C2=A0 =C2=A0 pipeline.done();

Results in a out1/out1.txt file which is just plain text.

Switching to the MRPipeline results in a out1/part-= m-00000.gz file which is, indeed, a gzip file.

I&#= 39;m not sure if this is a bug given the MemPipeline is likely only meant t= o be used for unit tests?


=C2=A0
=C2=A0
=
=C2=A0
=

2) The implementation of Compress.gzip= is

=C2= =A0 public static <T extends Target> T gzip(T target) {
<= div>=C2=A0 =C2=A0 return (T) compress(t= arget, GzipCodec.class)
=C2=A0 }

Does this mean it ca= n only work with Avro?

N= o, it's just that Avro has its own built-in support for gzip/snappy ser= ialization and it requires some extra conf to enable it. Any other output f= ormat will just ignore that configuration parameter.

Cool!
=C2=A0
=


Thanks!

DISCLAIMER:=C2=A0The conten= ts of this email, including any attachments, may contain information that i= s confidential, proprietary in nature, protected health information (PHI), = or otherwise protected by law from disclosure, and is solely for the use of= the intended recipient(s). If you are not the intended recipient, you are = hereby notified that any use, disclosure or copying of this email, includin= g any attachments, is unauthorized and strictly prohibited. If you have rec= eived this email in error, please notify the sender of this email. Please d= elete this and all copies of this email from your system. Any opinions eith= er expressed or implied in this email and all attachments, are those of its= author only, and do not necessarily reflect those of Nuna Health, Inc.



DISCLAIMER:=C2=A0The conten= ts of this email, including any attachments, may contain information that i= s confidential, proprietary in nature, protected health information (PHI), = or otherwise protected by law from disclosure, and is solely for the use of= the intended recipient(s). If you are not the intended recipient, you are = hereby notified that any use, disclosure or copying of this email, includin= g any attachments, is unauthorized and strictly prohibited. If you have rec= eived this email in error, please notify the sender of this email. Please d= elete this and all copies of this email from your system. Any opinions eith= er expressed or implied in this email and all attachments, are those of its= author only, and do not necessarily reflect those of Nuna Health, Inc.



DISCLAIMER:=C2=A0The conten= ts of this email, including any attachments, may contain information that i= s confidential, proprietary in nature, protected health information (PHI), = or otherwise protected by law from disclosure, and is solely for the use of= the intended recipient(s). If you are not the intended recipient, you are = hereby notified that any use, disclosure or copying of this email, includin= g any attachments, is unauthorized and strictly prohibited. If you have rec= eived this email in error, please notify the sender of this email. Please d= elete this and all copies of this email from your system. Any opinions eith= er expressed or implied in this email and all attachments, are those of its= author only, and do not necessarily reflect those of Nuna Health, Inc. --047d7b450b108514fe051fababb2--