crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mears <benjaminmme...@gmail.com>
Subject Re: In memory PCollection for use in MRPipeline
Date Thu, 22 Jan 2015 03:19:47 GMT
Hi Josh,

Thanks for the quick reply!

For me, I think a useful API would be to have an analogous
MRPipeline.collectionOf
and also potentially a method like MRPipeline.collectionFrom that takes in
a Java Iterable and returns a PCollection compatible with MRPipeline.

-Ben

On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Ben,
>
> No easy way to do it right now besides writing the data yourself, though
> that sort of simulation-based use case has been in the back of my mind ever
> since we added the NLineFileSource. What would your ideal API look like
> here?
>
> Thanks,
> J
>
> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <benjaminmmears@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to write a Crunch job to generate a large amount of simulated
>> data.  To kick the job off, I need inputs into a do function.  These inputs
>> are essentially dummy values that will be ignored in the do fn.  To
>> accomplish this, I'd like to create an inmemory PCollection that can then
>> be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
>> I get an error:
>>
>> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot
be serialized
>> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
>> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>>
>> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline?
>>
>> Thanks!
>>
>> -Ben
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message