crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: SparkPipeline possible avro reuse on cache()
Date Thu, 08 Oct 2015 18:51:03 GMT
Yeah, I could see how that would happen. I think the move would be to
inject a deep copy inside of the RDD that is underneath a cached
PCollection. I can probably take a crack at a patch later this weekend, I
have a busy couple of days w/the baby and new job and what not. :)

J

On Thu, Oct 8, 2015 at 9:32 AM, Nithin Asokan <anithin19@gmail.com> wrote:

> First I would like to thank everyone on the quick response and fixes on
> most issues. Great job everyone!
>
> I noticed that using cache() on PTable built using SparkPipeline seems to
> reuse object for downstream DoFn's. Here is an example that exhibits this
> behavior
>
> https://gist.github.com/nasokan/531b4ff9bf827d0835ab
>
> I would expect the output of this program to create a pair with same key,
> value. However, this produces Pair with different key value. I have tested
> this with text file input source and it works as expected. Removing cache()
> also produces expected result. So I'm suspecting this issue to be specific
> to avro and cache().
>
> Any thoughts on this behavior?
>
> Thank you!
> Nithin
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message