crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: SparkPipeline possible avro reuse on cache()
Date Thu, 08 Oct 2015 18:51:03 GMT
Yeah, I could see how that would happen. I think the move would be to
inject a deep copy inside of the RDD that is underneath a cached
PCollection. I can probably take a crack at a patch later this weekend, I
have a busy couple of days w/the baby and new job and what not. :)


On Thu, Oct 8, 2015 at 9:32 AM, Nithin Asokan <> wrote:

> First I would like to thank everyone on the quick response and fixes on
> most issues. Great job everyone!
> I noticed that using cache() on PTable built using SparkPipeline seems to
> reuse object for downstream DoFn's. Here is an example that exhibits this
> behavior
> I would expect the output of this program to create a pair with same key,
> value. However, this produces Pair with different key value. I have tested
> this with text file input source and it works as expected. Removing cache()
> also produces expected result. So I'm suspecting this issue to be specific
> to avro and cache().
> Any thoughts on this behavior?
> Thank you!
> Nithin

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message