crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: In memory PCollection for use in MRPipeline
Date Wed, 21 Jan 2015 19:19:26 GMT
Hey Ben,

No easy way to do it right now besides writing the data yourself, though
that sort of simulation-based use case has been in the back of my mind ever
since we added the NLineFileSource. What would your ideal API look like


On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <>

> Hi,
> I'm trying to write a Crunch job to generate a large amount of simulated
> data.  To kick the job off, I need inputs into a do function.  These inputs
> are essentially dummy values that will be ignored in the do fn.  To
> accomplish this, I'd like to create an inmemory PCollection that can then
> be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf
> I get an error:
> Exception in thread "main" java.lang.IllegalStateException:  named 'null' cannot be serialized
> 	at org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(
> 	at org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(
> Is it possible to explicitly declare/instantiate a PCollection to pass into an MRPipeline?
> Thanks!
> -Ben

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message