Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
Received-SPF: pass (nike.apache.org: domain of jayunit100@gmail.com designates
 209.85.215.49 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <2BBFFDD4-7C7B-417A-B3BB-08E3FF2F48AC@gmail.com>
References: 
 <CAAu13zEGjbGVTLhwgh2PZxLj-TOfnOHHhsXPAGUfQhrCZ5xyEg@mail.gmail.com>
	<CAH29n6Pp7vZiM75j=6p0+rp6x5KkqikD75VYxbptd+qaJv6d+g@mail.gmail.com>
	<2BBFFDD4-7C7B-417A-B3BB-08E3FF2F48AC@gmail.com>
Date: Sat, 4 Jan 2014 22:43:09 -0500
Message-ID: 
 <CAAu13zHp2o4XZNxexTrSGs9CVb7tv4QH-xS1V+MHAN_00ke-Jw@mail.gmail.com>
Subject: Re: crunch : correct way to think about tuple abstractions for
 aggregations?
From: Jay Vyas <jayunit100@gmail.com>
To: "user@crunch.apache.org" <user@crunch.apache.org>
Content-Type: multipart/alternative; boundary=001a11c21fba1b3d9b04ef30f1a7

--001a11c21fba1b3d9b04ef30f1a7
Content-Type: text/plain; charset=ISO-8859-1

BTW Thanks josh ! That worked!

Here is an example of how easy it is to do aggregations in crunch :) ~~~~~~

https://github.com/jayunit100/bigpetstore/commit/03a59fc88680d8926aba4c8d00760436c8cafb69

PS Are you sure PIG/HIVE is really better for this kind of stuff?  I really
like the IDE friendly, statically validated, strongly typed, functional
API  ALOT more than the russian roulette that I always seem to play with my
pig/hive code :)


On Sat, Jan 4, 2014 at 7:49 PM, Jay Vyas <jayunit100@gmail.com> wrote:

> Thanks josh ..That was very helpful!! ..I like the avro mapper
> intermediate solution I'll try it out.
>
> ...Also : would be interested in contributing a new "section" of the
> bigpetstore workflow , a module which really showed where crunch's
> differentiating factors were valuable?
>
> The idea is that bigpetstore should show the differences between different
> ecosystem components so that people can pick for themselves which tool is
> best for which job, and so I think it would be cool to have a phase in the
> bigpetstore workflow which used some nested, strongly typed data and
> processed it with crunch versus pig, to demonstrate (in code) the comments
> you've made.
>
> Right now I only have pig and hive but want to add in cascading and
> (obviously) crunch as well.
>
> On Jan 4, 2014, at 4:57 PM, Josh Wills <jwills@cloudera.com> wrote:
>
> Hey Jay,
>
> Crunch isn't big into tuples; it's mostly used to process some sort of
> structured, complex record data like Avro, protocol buffers, or Thrift. I
> certainly don't speak for everyone in the community, but I think that using
> one of these rich, evolvable formats is the best way to work with data on
> Hadoop. For the problem you gave, where the data is in CSV text, there are
> a couple of options.
>
> One option would be to use the TupleN type to represent a record and the
> Extractor API in crunch-contrib to parse the lines of strings into typed
> tokens, so you would do something like this to your PCollection<String>:
>
> PCollection<String> rawData = ...;
> TokenizerFactory tokenize = TokenizerFactory.builder().delim(",").build();
> PCollection<TupleN> tuples = Parse.parse("bigpetshop", // a name to use
> for the counters used in parsing
>     rawData,
>     xtupleN(tokenize,
>       xstring(),   // big pet store
>       xstring(),   // store code
>       xint(),        // line item
>       xstring(),  // first name
>       xstring(),  // last name
>       xstring(),  // timestamp
>       xdouble(),  // price
>       xstring()));   // item description
>
> You could also create a POJO to represent a LineItem (which is what I
> assume this is) and then use Avro reflection-based serialization to
> serialize it with Crunch:
>
> public static class LineItem {
>   String appName;
>   String storeCode;
>   int lineId;
>   String firstName;
>   String lastName;
>   String timestamp;
>   double price;
>   String description;
>
>   public LineItem() {
>      // Avro reflection needs a zero-arg constructor
>   }
>
>   // other constructors, parsers, etc.
> }
>
> and then you would have something like this:
>
> PCollection<LineItem> lineItems = rawData.parallelDo(new MapFn<String,
> LineItem>() {
>   @Override
>   public LineItem map(String input) {
>     // parse line to LineItem object
>   }
> }, Avros.reflects(LineItem.class));
>
> I'm not quite sure what you're doing in the grouping clause you have here:
>
> groupBy(0).count();
>
> ...I assume you want to count the distinct values of the first field in
> your tuple, which you would do like this for line items:
>
> PTable<String, Long> counts = lineItems.parallelDo(new MapFn<LineItem,
> String>() {
>   public String map(LineItem lineItem) { return lineItem.appName; }
> }, Avros.strings()).count();
>
> and similarly for TupleN, although you would call get(0) on TupleN and
> have to cast the returned Object to a String b/c TupleN methods don't have
> type information.
>
> I hope that helps. In general, I don't really recommend Crunch for this
> sort of data processing; Hive, Pig, and Cascading are fine alternatives.
> But I think Crunch is superior to any of them if you were trying to, say,
> create an Order record that aggregated the result of multiple LineItems:
>
> Order {
>   List<LineItem> lineItems;
>   // global order attributes
> }
>
> or a customer type that aggregated multiple Orders for a single customer:
>
> Customer {
>   List<Order> orders;
>   // other customer fields
> }
>
> ...especially if this was the sort of processing task you had to do
> regularly because lots of other downstream processing tasks required these
> standard aggregations to exist so that they could do their own
> calculations. I would also recommend Crunch if you were building
> BigPetStore on top of HBase using custom schemas that you needed to
> periodically MapReduce over in order to calculate statistics, cleanup stale
> data, or fix any consistency issues.
>
> Best,
> Josh
>
>
>
> On Sat, Jan 4, 2014 at 12:34 PM, Jay Vyas <jayunit100@gmail.com> wrote:
>
>> Hi crunch !
>>
>> I want to process a list in crunch:
>>
>> Something like this:
>>
>>         PCollection<String> lines = MemPipeline.collectionOf(
>>                 "BigPetStore,storeCode_AK,1  lindsay,franco,Sat Jan 10
>> 00:11:10 EST 1970,10.5,dog-food",
>>                 "BigPetStore,storeCode_AZ,1  tom,giles,Sun Dec 28
>> 23:08:45 EST 1969,10.5,dog-food",
>>                 "BigPetStore,storeCode_CA,1  brandon,ewing,Mon Dec 08
>> 20:23:57 EST 1969,16.5,organic-dog-food",
>>                 "BigPetStore,storeCode_CA,2  angie,coleman,Thu Dec 11
>> 07:00:31 EST 1969,10.5,dog-food",
>>                 "BigPetStore,storeCode_CA,3  angie,coleman,Tue Jan 20
>> 06:24:23 EST 1970,7.5,cat-food",
>>                 "BigPetStore,storeCode_CO,1  sharon,trevino,Mon Jan 12
>> 07:52:10 EST 1970,30.1,antelope snacks",
>>                 "BigPetStore,storeCode_CT,1  kevin,fitzpatrick,Wed Dec 10
>> 05:24:13 EST 1969,10.5,dog-food",
>>                 "BigPetStore,storeCode_NY,1  dale,holden,Mon Jan 12
>> 23:02:13 EST 1970,19.75,fish-food",
>>                 "BigPetStore,storeCode_NY,2  dale,holden,Tue Dec 30
>> 12:29:52 EST 1969,10.5,dog-food",
>>                 "BigPetStore,storeCode_OK,1  donnie,tucker,Sun Jan 18
>> 04:50:26 EST 1970,7.5,cat-food");
>>
>>         PCollection coll = lines.parallelDo(
>>               "split lines into words",
>>               new DoFn<String, String>() {
>>                   @Override
>>                   public void process(String line, Emitter emitter) {
>>                     //not sure this regex will work but you get the
>> idea.. split by tabs and commas
>>                     emitter.emit(Arrays.asList(line.split("\t,")));
>>                   }
>>               },
>>               Writables.lists()
>>         ).groupBy(0).count();
>>
>>         }
>>
>> What is the correct abstraction in crunch to convert raw text into
>> tuples,
>> and access them by an index - which you then use to group and count on?
>>
>> thanks !
>>
>> ** FYI ** this is for the bigpetstore project, id like to show crunch
>> examples in it if i can get them working,  as the API is a nice example of
>> a lowerlevel mapreduce paradigm which is more java freindly.
>>
>> See https://issues.apache.org/jira/browse/BIGTOP-1089 and
>> https://github.com/jayunit100/bigpetstore for details..
>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

--001a11c21fba1b3d9b04ef30f1a7
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">BTW Thanks josh ! That worked! <br><div><br></div><div>Her=
e is an example of how easy it is to do aggregations in crunch :) ~~~~~~<br=
></div><div><br><a href=3D"https://github.com/jayunit100/bigpetstore/commit=
/03a59fc88680d8926aba4c8d00760436c8cafb69">https://github.com/jayunit100/bi=
gpetstore/commit/03a59fc88680d8926aba4c8d00760436c8cafb69</a><br>
<br></div><div>PS Are you sure PIG/HIVE is really better for this kind of s=
tuff?=A0 I really like the IDE friendly, statically validated, strongly typ=
ed, functional API=A0 ALOT more than the russian roulette that I always see=
m to play with my pig/hive code :)<br>
<br><br></div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_=
quote">On Sat, Jan 4, 2014 at 7:49 PM, Jay Vyas <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:jayunit100@gmail.com" target=3D"_blank">jayunit100@gmail.com</=
a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"auto"><div>Thanks josh ..That wa=
s very helpful!! ..I like the avro mapper intermediate solution I&#39;ll tr=
y it out.</div>
<div><br></div><div>...Also : would be interested in contributing a new &qu=
ot;section&quot; of the bigpetstore workflow , a module which really showed=
 where crunch&#39;s differentiating factors were valuable?</div><div><br>
</div><div>The idea is that bigpetstore should show the differences between=
 different ecosystem components so that people can pick for themselves whic=
h tool is best for which job, and so I think it would be cool to have a pha=
se in the bigpetstore workflow which used some nested, strongly typed data =
and processed it with crunch versus pig, to demonstrate (in code) the comme=
nts you&#39;ve made.<br>
<br></div><div>Right now I only have pig and hive but want to add in cascad=
ing and (obviously) crunch as well.</div><div><div class=3D"h5"><div><br>On=
 Jan 4, 2014, at 4:57 PM, Josh Wills &lt;<a href=3D"mailto:jwills@cloudera.=
com" target=3D"_blank">jwills@cloudera.com</a>&gt; wrote:<br>
<br></div><blockquote type=3D"cite"><div><div dir=3D"ltr">Hey Jay,<div><br>=
</div><div>Crunch isn&#39;t big into tuples; it&#39;s mostly used to proces=
s some sort of structured, complex record data like Avro, protocol buffers,=
 or Thrift. I certainly don&#39;t speak for everyone in the community, but =
I think that using one of these rich, evolvable formats is the best way to =
work with data on Hadoop. For the problem you gave, where the data is in CS=
V text, there are a couple of options.</div>


<div><br></div><div>One option would be to use the TupleN type to represent=
 a record and the Extractor API in crunch-contrib to parse the lines of str=
ings into typed tokens, so you would do something like this to your PCollec=
tion&lt;String&gt;:</div>


<div><br></div><div>PCollection&lt;String&gt; rawData =3D ...;<br></div><di=
v>TokenizerFactory tokenize =3D TokenizerFactory.builder().delim(&quot;,&qu=
ot;).build();</div><div>PCollection&lt;TupleN&gt; tuples =3D Parse.parse(&q=
uot;bigpetshop&quot;, // a name to use for the counters used in parsing</di=
v>


<div>=A0 =A0 rawData,</div><div>=A0 =A0 xtupleN(tokenize,</div><div>=A0 =A0=
 =A0 xstring(), =A0 // big pet store</div><div>=A0 =A0 =A0 xstring(), =A0 /=
/ store code</div><div>=A0 =A0 =A0 xint(), =A0 =A0 =A0 =A0// line item</div=
><div>=A0 =A0 =A0 xstring(), =A0// first name</div>


<div>=A0 =A0 =A0 xstring(), =A0// last name</div><div>=A0 =A0 =A0 xstring()=
, =A0// timestamp</div><div>=A0 =A0 =A0 xdouble(), =A0// price</div><div>=
=A0 =A0 =A0 xstring())); =A0 // item description</div><div><br></div><div>Y=
ou could also create a POJO to represent a LineItem (which is what I assume=
 this is) and then use Avro reflection-based serialization to serialize it =
with Crunch:</div>


<div><br></div><div>public static class LineItem {</div><div>=A0 String app=
Name;</div><div>=A0 String storeCode;</div><div>=A0 int lineId;</div><div>=
=A0 String firstName;</div><div>=A0 String lastName;</div><div>=A0 String t=
imestamp;</div>


<div>=A0 double price;</div><div>=A0 String description;</div><div><br></di=
v><div>=A0 public LineItem() {</div><div>=A0 =A0 =A0// Avro reflection need=
s a zero-arg constructor</div><div>=A0 }</div><div><br></div><div>=A0 // ot=
her constructors, parsers, etc.</div>


<div>}</div><div><br></div><div>and then you would have something like this=
:</div><div><br></div><div>PCollection&lt;LineItem&gt; lineItems =3D rawDat=
a.parallelDo(new MapFn&lt;String, LineItem&gt;() {</div><div>=A0 @Override<=
/div>


<div>=A0 public LineItem map(String input) {</div><div>=A0 =A0 // parse lin=
e to LineItem object</div><div>=A0 }</div><div>}, Avros.reflects(LineItem.c=
lass));</div><div><br></div><div>I&#39;m not quite sure what you&#39;re doi=
ng in the grouping clause you have here:</div>


<div><br></div><div>groupBy(0).count();</div><div><br></div><div>...I assum=
e you want to count the distinct values of the first field in your tuple, w=
hich you would do like this for line items:</div><div><br></div><div>PTable=
&lt;String, Long&gt; counts =3D lineItems.parallelDo(new MapFn&lt;LineItem,=
 String&gt;() {</div>


<div>=A0 public String map(LineItem lineItem) { return lineItem.appName; }<=
/div><div>}, Avros.strings()).count();</div><div><br></div><div>and similar=
ly for TupleN, although you would call get(0) on TupleN and have to cast th=
e returned Object to a String b/c TupleN methods don&#39;t have type inform=
ation.</div>


<div><br></div><div>I hope that helps. In general, I don&#39;t really recom=
mend Crunch for this sort of data processing; Hive, Pig, and Cascading are =
fine alternatives. But I think Crunch is superior to any of them if you wer=
e trying to, say, create an Order record that aggregated the result of mult=
iple LineItems:</div>


<div><br></div><div>Order {</div><div>=A0 List&lt;LineItem&gt; lineItems;</=
div><div>=A0 // global order attributes</div><div>}</div><div><br></div><di=
v>or a customer type that aggregated multiple Orders for a single customer:=
</div>


<div><br></div><div>Customer {</div><div>=A0 List&lt;Order&gt; orders;</div=
><div>=A0 // other customer fields</div><div>}</div><div><br></div><div>...=
especially if this was the sort of processing task you had to do regularly =
because lots of other downstream processing tasks required these standard a=
ggregations to exist so that they could do their own calculations. I would =
also recommend Crunch if you were building BigPetStore on top of HBase usin=
g custom schemas that you needed to periodically MapReduce over in order to=
 calculate statistics, cleanup stale data, or fix any consistency issues.</=
div>


<div><br></div><div>Best,<br>Josh</div><div><br></div></div><div class=3D"g=
mail_extra"><br><br><div class=3D"gmail_quote">On Sat, Jan 4, 2014 at 12:34=
 PM, Jay Vyas <span dir=3D"ltr">&lt;<a href=3D"mailto:jayunit100@gmail.com"=
 target=3D"_blank">jayunit100@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>Hi crunch ! <br><br>I =
want to process a list in crunch:<br><br>Something like this: <br><br>=A0=
=A0=A0=A0=A0=A0=A0 PCollection&lt;String&gt; lines =3D MemPipeline.collecti=
onOf(<br>


=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;BigPetStore,storeCode_A=
K,1=A0 lindsay,franco,Sat Jan 10 00:11:10 EST 1970,10.5,dog-food&quot;,<br>
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;BigPetStore,storeCode_A=
Z,1=A0 tom,giles,Sun Dec 28 23:08:45 EST 1969,10.5,dog-food&quot;,<br>=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;BigPetStore,storeCode_CA,1=
=A0 brandon,ewing,Mon Dec 08 20:23:57 EST 1969,16.5,organic-dog-food&quot;,=
<br>


=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;BigPetStore,storeCode_C=
A,2=A0 angie,coleman,Thu Dec 11 07:00:31 EST 1969,10.5,dog-food&quot;,<br>=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;BigPetStore,storeCode_C=
A,3=A0 angie,coleman,Tue Jan 20 06:24:23 EST 1970,7.5,cat-food&quot;,<br>


=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;BigPetStore,storeCode_C=
O,1=A0 sharon,trevino,Mon Jan 12 07:52:10 EST 1970,30.1,antelope snacks&quo=
t;,<br>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;BigPetStore,stor=
eCode_CT,1=A0 kevin,fitzpatrick,Wed Dec 10 05:24:13 EST 1969,10.5,dog-food&=
quot;,<br>


=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;BigPetStore,storeCode_N=
Y,1=A0 dale,holden,Mon Jan 12 23:02:13 EST 1970,19.75,fish-food&quot;,<br>=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;BigPetStore,storeCode_N=
Y,2=A0 dale,holden,Tue Dec 30 12:29:52 EST 1969,10.5,dog-food&quot;,<br>


=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;BigPetStore,storeCode_O=
K,1=A0 donnie,tucker,Sun Jan 18 04:50:26 EST 1970,7.5,cat-food&quot;);<br>=
=A0=A0=A0=A0=A0=A0=A0 <br>=A0=A0=A0=A0=A0=A0=A0 PCollection coll =3D lines.=
parallelDo(<br>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 &quot;split lines in=
to words&quot;, <br>


=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 new DoFn&lt;String, String&gt;() {<=
br>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 @Override<br>=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 public void process(String li=
ne, Emitter emitter) {<br></div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0 //not sure this regex will work but you get the idea.. spli=
t by tabs and commas=A0 <br>


<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 emitter.emit=
(Arrays.asList(line.split(&quot;\t,&quot;)));<br>=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0 }<br>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 },=
 <br>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Writables.lists()<br>=A0=A0=A0=
=A0=A0=A0=A0 ).groupBy(0).count();<br>=A0=A0=A0=A0=A0=A0=A0 <br>=A0=A0=A0=
=A0=A0=A0=A0 }<br>


<br></div><div>What is the correct abstraction in crunch to convert raw tex=
t into tuples, <br>and access them by an index - which you then use to grou=
p and count on? <br><br>thanks !<br><br></div><div>** FYI ** this is for th=
e bigpetstore project, id like to show crunch examples in it if i can get t=
hem working,=A0 as the API is a nice example of a lowerlevel mapreduce para=
digm which is more java freindly. <br>


<br></div><div>See <a href=3D"https://issues.apache.org/jira/browse/BIGTOP-=
1089" target=3D"_blank">https://issues.apache.org/jira/browse/BIGTOP-1089</=
a> and <a href=3D"https://github.com/jayunit100/bigpetstore" target=3D"_bla=
nk">https://github.com/jayunit100/bigpetstore</a> for details.. <br>


</div><div><br></div><div><br></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div>Directo=
r of Data Science</div><div><a href=3D"http://www.cloudera.com" target=3D"_=
blank">Cloudera</a></div><div>Twitter: <a href=3D"http://twitter.com/josh_w=
ills" target=3D"_blank">@josh_wills</a></div>


</div>
</div></blockquote></div></div></div></blockquote></div><br><br clear=3D"al=
l"><br>-- <br>Jay Vyas<br><a href=3D"http://jayunit100.blogspot.com" target=
=3D"_blank">http://jayunit100.blogspot.com</a>
</div>

--001a11c21fba1b3d9b04ef30f1a7--