Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
Received-SPF: pass (athena.apache.org: domain of benjaminmmears@gmail.com
 designates 74.125.82.46 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAH29n6O_toBezsSyF3JOg8y+XRg-o8F6fkU0JbJV2GF44pboYA@mail.gmail.com>
References: 
 <CAH0jVuUebwdKnKyTXjtJxB_N+S+ZDuv8by-t24GBUghh6vXO8g@mail.gmail.com>
 <CAH29n6MDZO3Eja215bF=A0x=o9jaoQVxUOx8wXvF8zP1hz1tmQ@mail.gmail.com>
 <CAH0jVuVxv+5f6yv9VJBKtO2xpt7QF+c36Xf3LMHPXENMu6w0-A@mail.gmail.com>
 <CAH29n6PgdPrn=OFfiYXZ26RiOtXHZ-fy1irBp=8_3nzaCNbbzg@mail.gmail.com>
 <CAH0jVuVaL4aFkdG3TjfDBQV7Wdts3WP6yiUkE+9fcQXE3JGwfQ@mail.gmail.com>
 <CAH29n6O_toBezsSyF3JOg8y+XRg-o8F6fkU0JbJV2GF44pboYA@mail.gmail.com>
From: Benjamin Mears <benjaminmmears@gmail.com>
Date: Thu, 22 Jan 2015 13:04:50 -0800
Message-ID: 
 <CAH0jVuUCm4wgwcY8GDL-wug7fDxk6hFqM0Jq2LzFiMCY9i7LRA@mail.gmail.com>
Subject: Re: In memory PCollection for use in MRPipeline
To: user@crunch.apache.org
Content-Type: multipart/alternative; boundary=001a113602ac20bb95050d4407ee

--001a113602ac20bb95050d4407ee
Content-Type: text/plain; charset=UTF-8

Great, thanks!

-Ben

On Thu, Jan 22, 2015 at 10:12 AM, Josh Wills <jwills@cloudera.com> wrote:

> The in-memory and Spark versions are pretty easy, the MR one will be a bit
> more work. Will track this at
> https://issues.apache.org/jira/browse/CRUNCH-489
>
> J
>
> On Wed, Jan 21, 2015 at 9:24 PM, Benjamin Mears <benjaminmmears@gmail.com>
> wrote:
>
>> Hi Josh,
>>
>> 1) Yes, having a version that allowed a specification of parallelism
>> would be very useful!  I had been thinking of using scaleFactor to try to
>> force a higher degree of parallelism but not sure if that would have worked
>> and being able to explicitly specify the parallelism is much cleaner.
>>
>> 2) Yes, the difference would be a varargs array vs. an iterable as the
>> argument so having the analogous overloaded methods to
>> MemPipeline.typedCollectionOf would probably be best (sorry, I didn't
>> initially notice typedCollectionOf and collectionOf each had two overloaded
>> versions).
>>
>> Thanks again!
>>
>> -Ben
>>
>>
>> On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> Hey Ben,
>>>
>>> Couple of questions:
>>>
>>> 1) If one potential use case for this was running simulations, wouldn't
>>> you want a version of collectionOf that allowed you to specify parallelism,
>>> like via NLineFileSource?
>>> 2) collectionOf vs. collectionFrom: do you just mean like a varargs
>>> array vs. an Iterable as the argument difference here? I also think that
>>> whatever version of this I did would have to take a PType so we knew how to
>>> serialize the data, so they would look more like typedCollectionOf on
>>> MemPipeline.
>>>
>>> Thanks!
>>> J
>>>
>>> On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <
>>> benjaminmmears@gmail.com> wrote:
>>>
>>>> Hi Josh,
>>>>
>>>> Thanks for the quick reply!
>>>>
>>>> For me, I think a useful API would be to have an analogous MRPipeline.collectionOf
>>>> and also potentially a method like MRPipeline.collectionFrom that takes in
>>>> a Java Iterable and returns a PCollection compatible with MRPipeline.
>>>>
>>>> -Ben
>>>>
>>>> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <jwills@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hey Ben,
>>>>>
>>>>> No easy way to do it right now besides writing the data yourself,
>>>>> though that sort of simulation-based use case has been in the back of my
>>>>> mind ever since we added the NLineFileSource. What would your ideal API
>>>>> look like here?
>>>>>
>>>>> Thanks,
>>>>> J
>>>>>
>>>>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <
>>>>> benjaminmmears@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to write a Crunch job to generate a large amount of
>>>>>> simulated data.  To kick the job off, I need inputs into a do function.
>>>>>> These inputs are essentially dummy values that will be ignored in the do
>>>>>> fn.  To accomplish this, I'd like to create an inmemory PCollection that
>>>>>> can then be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
>>>>>> I get an error:
>>>>>>
>>>>>> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot be serialized
>>>>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
>>>>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>>>>>>
>>>>>> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> -Ben
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

--001a113602ac20bb95050d4407ee
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Great, thanks!<div><br></div><div>-Ben</div></div><div cla=
ss=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Jan 22, 2015 at 1=
0:12 AM, Josh Wills <span dir=3D"ltr">&lt;<a href=3D"mailto:jwills@cloudera=
.com" target=3D"_blank">jwills@cloudera.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex"><div dir=3D"ltr">The in-memory and Spark versions =
are pretty easy, the MR one will be a bit more work. Will track this at=C2=
=A0<a href=3D"https://issues.apache.org/jira/browse/CRUNCH-489" target=3D"_=
blank">https://issues.apache.org/jira/browse/CRUNCH-489</a><span class=3D"H=
OEnZb"><font color=3D"#888888"><div><br></div><div>J</div></font></span></d=
iv><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><=
div class=3D"gmail_quote">On Wed, Jan 21, 2015 at 9:24 PM, Benjamin Mears <=
span dir=3D"ltr">&lt;<a href=3D"mailto:benjaminmmears@gmail.com" target=3D"=
_blank">benjaminmmears@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr">Hi Josh,<div><br></div><div>1) Yes, having a=
 version that allowed a specification of parallelism would be very useful!=
=C2=A0 I had been thinking of using scaleFactor to try to force a higher de=
gree of parallelism but not sure if that would have worked and being able t=
o explicitly specify the parallelism is much cleaner.</div><div><br></div><=
div>2) Yes, the difference would be a varargs array vs. an iterable as the =
argument so having the analogous overloaded methods to MemPipeline.typedCol=
lectionOf would probably be best (sorry, I didn&#39;t initially notice type=
dCollectionOf and collectionOf each had two overloaded versions).</div><div=
><br></div><div>Thanks again!</div><div><br></div><div>-Ben</div><div><br><=
/div></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quo=
te">On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:jwills@cloudera.com" target=3D"_blank">jwills@cloudera.com</a>=
&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hey B=
en,<div><br></div><div>Couple of questions:</div><div><br></div><div>1) If =
one potential use case for this was running simulations, wouldn&#39;t you w=
ant a version of collectionOf that allowed you to specify parallelism, like=
 via NLineFileSource?</div><div>2) collectionOf vs. collectionFrom: do you =
just mean like a varargs array vs. an Iterable as the argument difference h=
ere? I also think that whatever version of this I did would have to take a =
PType so we knew how to serialize the data, so they would look more like ty=
pedCollectionOf on MemPipeline.</div><div><br></div><div>Thanks!<span><font=
 color=3D"#888888"><br>J</font></span></div></div><div><div><div class=3D"g=
mail_extra"><br><div class=3D"gmail_quote">On Wed, Jan 21, 2015 at 7:19 PM,=
 Benjamin Mears <span dir=3D"ltr">&lt;<a href=3D"mailto:benjaminmmears@gmai=
l.com" target=3D"_blank">benjaminmmears@gmail.com</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Josh,<div><br></div><div=
>Thanks for the quick reply!</div><div><br></div><div>For me, I think a use=
ful API would be to have an analogous=C2=A0<span style=3D"color:rgb(0,0,0);=
font-family:arial,helvetica,sans-serif;font-size:12.7272720336914px;white-s=
pace:pre-wrap">MRPipeline.collectionOf and also potentially a method like M=
RPipeline.collectionFrom that takes in a Java Iterable and returns a PColle=
ction compatible with MRPipeline.</span></div><div><span style=3D"color:rgb=
(0,0,0);font-family:arial,helvetica,sans-serif;font-size:12.7272720336914px=
;white-space:pre-wrap"><br></span></div><div><span style=3D"color:rgb(0,0,0=
);font-family:arial,helvetica,sans-serif;font-size:12.7272720336914px;white=
-space:pre-wrap">-Ben</span></div></div><div><div><div class=3D"gmail_extra=
"><br><div class=3D"gmail_quote">On Wed, Jan 21, 2015 at 11:19 AM, Josh Wil=
ls <span dir=3D"ltr">&lt;<a href=3D"mailto:jwills@cloudera.com" target=3D"_=
blank">jwills@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex"><div dir=3D"ltr">Hey Ben,<div><br></div><div>No easy way to do it r=
ight now besides writing the data yourself, though that sort of simulation-=
based use case has been in the back of my mind ever since we added the NLin=
eFileSource. What would your ideal API look like here?<br><br>Thanks,<br>J<=
/div></div><div class=3D"gmail_extra"><div><div><br><div class=3D"gmail_quo=
te">On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <span dir=3D"ltr">&lt;<=
a href=3D"mailto:benjaminmmears@gmail.com" target=3D"_blank">benjaminmmears=
@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">Hi,<div><br></div><div>I&#39;m trying to write a Crunch job to gen=
erate a large amount of simulated data.=C2=A0 To kick the job off, I need i=
nputs into a do function.=C2=A0 These inputs are essentially dummy values t=
hat will be ignored in the do fn.=C2=A0 To accomplish this, I&#39;d like to=
 create an inmemory PCollection that can then be passed into a MR pipeline,=
 but <font color=3D"#000000" face=3D"arial, helvetica, sans-serif"><span st=
yle=3D"white-space:pre-wrap">if I do this with=C2=A0</span></font><span sty=
le=3D"color:rgb(0,0,0);font-family:arial,helvetica,sans-serif;white-space:p=
re-wrap">MemPipeline.collectionOf I get an error</span>:</div><div><br></di=
v><div><pre style=3D"color:rgb(0,0,0);word-wrap:break-word;white-space:pre-=
wrap">Exception in thread &quot;main&quot; java.lang.IllegalStateException:=
  named &#39;null&#39; cannot be serialized
	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(Mem=
Collection.java:110)
	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollecti=
on.java:129)<br></pre><pre style=3D"word-wrap:break-word"><font color=3D"#0=
00000" face=3D"arial, helvetica, sans-serif"><span style=3D"white-space:pre=
-wrap">Is it possible to explicitly declare/instantiate a PCollection to pa=
ss into an MRPipeline?</span></font></pre><pre style=3D"word-wrap:break-wor=
d"><font face=3D"arial, helvetica, sans-serif">Thanks!</font></pre><pre sty=
le=3D"word-wrap:break-word"><font face=3D"arial, helvetica, sans-serif">-Be=
n</font></pre></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br><div><div>Director of Data Science</div><div>=
<a href=3D"http://www.cloudera.com" target=3D"_blank">Cloudera</a></div><di=
v>Twitter: <a href=3D"http://twitter.com/josh_wills" target=3D"_blank">@jos=
h_wills</a></div></div>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div><div>Director of Data Science</div><div><a href=3D"http://www.cloudera=
.com" target=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://t=
witter.com/josh_wills" target=3D"_blank">@josh_wills</a></div></div>
</div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div><div>Director of Data Science</div><div><a href=3D"http://www.cloudera=
.com" target=3D"_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://t=
witter.com/josh_wills" target=3D"_blank">@josh_wills</a></div></div>
</div>
</div></div></blockquote></div><br></div>

--001a113602ac20bb95050d4407ee--