Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8F01110B72 for ; Thu, 22 Jan 2015 21:05:37 +0000 (UTC) Received: (qmail 23210 invoked by uid 500); 22 Jan 2015 21:05:37 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 23161 invoked by uid 500); 22 Jan 2015 21:05:37 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 23151 invoked by uid 99); 22 Jan 2015 21:05:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jan 2015 21:05:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of benjaminmmears@gmail.com designates 74.125.82.46 as permitted sender) Received: from [74.125.82.46] (HELO mail-wg0-f46.google.com) (74.125.82.46) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jan 2015 21:05:33 +0000 Received: by mail-wg0-f46.google.com with SMTP id l2so4061188wgh.5 for ; Thu, 22 Jan 2015 13:05:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=WhFG2Jird/d66T7PB7oVvorVyJILsS5YE2HLCyF1+9k=; b=EuqnLBd3Em1Cg5n+NI4hQAg/kDkmnvPD7YI/klr12liXpqIzdizRfojrkPgcheXC+f Lil7Yyp8Zy5l/t/2w69HyxGKZs642gFopr4avfr2gkuRcX2AOUeoF1M943M0dqOT2kna WJHq+hOEpkA65jFfjoGHCXzN4WM6pKDeM4cD1x6KZpHqcT4ac0trjDjqo9+jRclNmMcw /j6h0Q2Le3+ttC0Fo6k2MgnVCunfAtPD/bo+9BG6Zb/5rRjVKU3ymxSXPGsnorzLcisk mrEVH3ezQvdlb1ilMykFcwoiBvJJpMsfHkJ/Xr9JLxtiQw/sHFOn2C0D5VRIFmGKu+Ml rFWg== X-Received: by 10.181.9.107 with SMTP id dr11mr9205140wid.22.1421960712073; Thu, 22 Jan 2015 13:05:12 -0800 (PST) MIME-Version: 1.0 Received: by 10.180.84.138 with HTTP; Thu, 22 Jan 2015 13:04:50 -0800 (PST) In-Reply-To: References: From: Benjamin Mears Date: Thu, 22 Jan 2015 13:04:50 -0800 Message-ID: Subject: Re: In memory PCollection for use in MRPipeline To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=001a113602ac20bb95050d4407ee X-Virus-Checked: Checked by ClamAV on apache.org --001a113602ac20bb95050d4407ee Content-Type: text/plain; charset=UTF-8 Great, thanks! -Ben On Thu, Jan 22, 2015 at 10:12 AM, Josh Wills wrote: > The in-memory and Spark versions are pretty easy, the MR one will be a bit > more work. Will track this at > https://issues.apache.org/jira/browse/CRUNCH-489 > > J > > On Wed, Jan 21, 2015 at 9:24 PM, Benjamin Mears > wrote: > >> Hi Josh, >> >> 1) Yes, having a version that allowed a specification of parallelism >> would be very useful! I had been thinking of using scaleFactor to try to >> force a higher degree of parallelism but not sure if that would have worked >> and being able to explicitly specify the parallelism is much cleaner. >> >> 2) Yes, the difference would be a varargs array vs. an iterable as the >> argument so having the analogous overloaded methods to >> MemPipeline.typedCollectionOf would probably be best (sorry, I didn't >> initially notice typedCollectionOf and collectionOf each had two overloaded >> versions). >> >> Thanks again! >> >> -Ben >> >> >> On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills wrote: >> >>> Hey Ben, >>> >>> Couple of questions: >>> >>> 1) If one potential use case for this was running simulations, wouldn't >>> you want a version of collectionOf that allowed you to specify parallelism, >>> like via NLineFileSource? >>> 2) collectionOf vs. collectionFrom: do you just mean like a varargs >>> array vs. an Iterable as the argument difference here? I also think that >>> whatever version of this I did would have to take a PType so we knew how to >>> serialize the data, so they would look more like typedCollectionOf on >>> MemPipeline. >>> >>> Thanks! >>> J >>> >>> On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears < >>> benjaminmmears@gmail.com> wrote: >>> >>>> Hi Josh, >>>> >>>> Thanks for the quick reply! >>>> >>>> For me, I think a useful API would be to have an analogous MRPipeline.collectionOf >>>> and also potentially a method like MRPipeline.collectionFrom that takes in >>>> a Java Iterable and returns a PCollection compatible with MRPipeline. >>>> >>>> -Ben >>>> >>>> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills >>>> wrote: >>>> >>>>> Hey Ben, >>>>> >>>>> No easy way to do it right now besides writing the data yourself, >>>>> though that sort of simulation-based use case has been in the back of my >>>>> mind ever since we added the NLineFileSource. What would your ideal API >>>>> look like here? >>>>> >>>>> Thanks, >>>>> J >>>>> >>>>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears < >>>>> benjaminmmears@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm trying to write a Crunch job to generate a large amount of >>>>>> simulated data. To kick the job off, I need inputs into a do function. >>>>>> These inputs are essentially dummy values that will be ignored in the do >>>>>> fn. To accomplish this, I'd like to create an inmemory PCollection that >>>>>> can then be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf >>>>>> I get an error: >>>>>> >>>>>> Exception in thread "main" java.lang.IllegalStateException: named 'null' cannot be serialized >>>>>> at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110) >>>>>> at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129) >>>>>> >>>>>> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -Ben >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Director of Data Science >>>>> Cloudera >>>>> Twitter: @josh_wills >>>>> >>>> >>>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera >>> Twitter: @josh_wills >>> >> >> > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > --001a113602ac20bb95050d4407ee Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Great, thanks!

-Ben

On Thu, Jan 22, 2015 at 1= 0:12 AM, Josh Wills <jwills@cloudera.com> wrote:
The in-memory and Spark versions = are pretty easy, the MR one will be a bit more work. Will track this at=C2= =A0https://issues.apache.org/jira/browse/CRUNCH-489

J

<= div class=3D"gmail_quote">On Wed, Jan 21, 2015 at 9:24 PM, Benjamin Mears <= span dir=3D"ltr"><benjaminmmears@gmail.com> wrote:
Hi Josh,

1) Yes, having a= version that allowed a specification of parallelism would be very useful!= =C2=A0 I had been thinking of using scaleFactor to try to force a higher de= gree of parallelism but not sure if that would have worked and being able t= o explicitly specify the parallelism is much cleaner.

<= div>2) Yes, the difference would be a varargs array vs. an iterable as the = argument so having the analogous overloaded methods to MemPipeline.typedCol= lectionOf would probably be best (sorry, I didn't initially notice type= dCollectionOf and collectionOf each had two overloaded versions).

Thanks again!

-Ben

<= /div>

On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <jwills@cloudera.com= > wrote:
Hey B= en,

Couple of questions:

1) If = one potential use case for this was running simulations, wouldn't you w= ant a version of collectionOf that allowed you to specify parallelism, like= via NLineFileSource?
2) collectionOf vs. collectionFrom: do you = just mean like a varargs array vs. an Iterable as the argument difference h= ere? I also think that whatever version of this I did would have to take a = PType so we knew how to serialize the data, so they would look more like ty= pedCollectionOf on MemPipeline.

Thanks!
J

On Wed, Jan 21, 2015 at 7:19 PM,= Benjamin Mears <benjaminmmears@gmail.com> wrote:
=
Hi Josh,

Thanks for the quick reply!

For me, I think a use= ful API would be to have an analogous=C2=A0MRPipeline.collectionOf and also potentially a method like M= RPipeline.collectionFrom that takes in a Java Iterable and returns a PColle= ction compatible with MRPipeline.

-Ben

On Wed, Jan 21, 2015 at 11:19 AM, Josh Wil= ls <jwills@cloudera.com> wrote:
Hey Ben,

No easy way to do it r= ight now besides writing the data yourself, though that sort of simulation-= based use case has been in the back of my mind ever since we added the NLin= eFileSource. What would your ideal API look like here?

Thanks,
J<= /div>

On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <<= a href=3D"mailto:benjaminmmears@gmail.com" target=3D"_blank">benjaminmmears= @gmail.com> wrote:
Hi,

I'm trying to write a Crunch job to gen= erate a large amount of simulated data.=C2=A0 To kick the job off, I need i= nputs into a do function.=C2=A0 These inputs are essentially dummy values t= hat will be ignored in the do fn.=C2=A0 To accomplish this, I'd like to= create an inmemory PCollection that can then be passed into a MR pipeline,= but if I do this with=C2=A0MemPipeline.collectionOf I get an error:

Exception in thread "main" java.lang.IllegalStateException:=
  named 'null' cannot be serialized
	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(Mem=
Collection.java:110)
	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollecti=
on.java:129)
Is it possible to explicitly declare/instantiate a PCollection to pa=
ss into an MRPipeline?
Thanks!
-Be=
n



<= font color=3D"#888888">--
Director of Data Science
Twitter: @jos= h_wills




--
=
Director of Data Science
Twitter: @josh_wills




--
=
Director of Data Science
Twitter: @josh_wills

--001a113602ac20bb95050d4407ee--