crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: In memory PCollection for use in MRPipeline
Date Wed, 21 Jan 2015 19:19:26 GMT
Hey Ben,

No easy way to do it right now besides writing the data yourself, though
that sort of simulation-based use case has been in the back of my mind ever
since we added the NLineFileSource. What would your ideal API look like
here?

Thanks,
J

On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <benjaminmmears@gmail.com>
wrote:

> Hi,
>
> I'm trying to write a Crunch job to generate a large amount of simulated
> data.  To kick the job off, I need inputs into a do function.  These inputs
> are essentially dummy values that will be ignored in the do fn.  To
> accomplish this, I'd like to create an inmemory PCollection that can then
> be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
> I get an error:
>
> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot be serialized
> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>
> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline?
>
> Thanks!
>
> -Ben
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message