crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: In memory PCollection for use in MRPipeline
Date Thu, 22 Jan 2015 04:58:37 GMT
Hey Ben,

Couple of questions:

1) If one potential use case for this was running simulations, wouldn't you
want a version of collectionOf that allowed you to specify parallelism,
like via NLineFileSource?
2) collectionOf vs. collectionFrom: do you just mean like a varargs array
vs. an Iterable as the argument difference here? I also think that whatever
version of this I did would have to take a PType so we knew how to
serialize the data, so they would look more like typedCollectionOf on
MemPipeline.

Thanks!
J

On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <benjaminmmears@gmail.com>
wrote:

> Hi Josh,
>
> Thanks for the quick reply!
>
> For me, I think a useful API would be to have an analogous MRPipeline.collectionOf
> and also potentially a method like MRPipeline.collectionFrom that takes in
> a Java Iterable and returns a PCollection compatible with MRPipeline.
>
> -Ben
>
> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Hey Ben,
>>
>> No easy way to do it right now besides writing the data yourself, though
>> that sort of simulation-based use case has been in the back of my mind ever
>> since we added the NLineFileSource. What would your ideal API look like
>> here?
>>
>> Thanks,
>> J
>>
>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <benjaminmmears@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I'm trying to write a Crunch job to generate a large amount of simulated
>>> data.  To kick the job off, I need inputs into a do function.  These inputs
>>> are essentially dummy values that will be ignored in the do fn.  To
>>> accomplish this, I'd like to create an inmemory PCollection that can then
>>> be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
>>> I get an error:
>>>
>>> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot
be serialized
>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
>>> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>>>
>>> Is it possible to explicitly declare/instantiate a PCollection to pass into an
MRPipeline?
>>>
>>> Thanks!
>>>
>>> -Ben
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message