crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mears <benjaminmme...@gmail.com>
Subject Re: In memory PCollection for use in MRPipeline
Date Thu, 22 Jan 2015 05:24:10 GMT
Hi Josh,

1) Yes, having a version that allowed a specification of parallelism would
be very useful!  I had been thinking of using scaleFactor to try to force a
higher degree of parallelism but not sure if that would have worked and
being able to explicitly specify the parallelism is much cleaner.

2) Yes, the difference would be a varargs array vs. an iterable as the
argument so having the analogous overloaded methods to
MemPipeline.typedCollectionOf would probably be best (sorry, I didn't
initially notice typedCollectionOf and collectionOf each had two overloaded
versions).

Thanks again!

-Ben


On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Ben,
>
> Couple of questions:
>
> 1) If one potential use case for this was running simulations, wouldn't
> you want a version of collectionOf that allowed you to specify parallelism,
> like via NLineFileSource?
> 2) collectionOf vs. collectionFrom: do you just mean like a varargs array
> vs. an Iterable as the argument difference here? I also think that whatever
> version of this I did would have to take a PType so we knew how to
> serialize the data, so they would look more like typedCollectionOf on
> MemPipeline.
>
> Thanks!
> J
>
> On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <benjaminmmears@gmail.com>
> wrote:
>
>> Hi Josh,
>>
>> Thanks for the quick reply!
>>
>> For me, I think a useful API would be to have an analogous MRPipeline.collectionOf
>> and also potentially a method like MRPipeline.collectionFrom that takes in
>> a Java Iterable and returns a PCollection compatible with MRPipeline.
>>
>> -Ben
>>
>> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> Hey Ben,
>>>
>>> No easy way to do it right now besides writing the data yourself, though
>>> that sort of simulation-based use case has been in the back of my mind ever
>>> since we added the NLineFileSource. What would your ideal API look like
>>> here?
>>>
>>> Thanks,
>>> J
>>>
>>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <
>>> benjaminmmears@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to write a Crunch job to generate a large amount of
>>>> simulated data.  To kick the job off, I need inputs into a do function.
>>>> These inputs are essentially dummy values that will be ignored in the do
>>>> fn.  To accomplish this, I'd like to create an inmemory PCollection that
>>>> can then be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
>>>> I get an error:
>>>>
>>>> Exception in thread "main" java.lang.IllegalStateException:  named 'null'
cannot be serialized
>>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
>>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>>>>
>>>> Is it possible to explicitly declare/instantiate a PCollection to pass into
an MRPipeline?
>>>>
>>>> Thanks!
>>>>
>>>> -Ben
>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message