crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nithin Asokan <anithi...@gmail.com>
Subject Re: SparkPipeline possible avro reuse on cache()
Date Thu, 08 Oct 2015 21:48:44 GMT
Thanks Josh. I logged https://issues.apache.org/jira/browse/CRUNCH-569 and
will try submitting a patch for this.

On Thu, Oct 8, 2015 at 1:51 PM Josh Wills <jwills@cloudera.com> wrote:

> Yeah, I could see how that would happen. I think the move would be to
> inject a deep copy inside of the RDD that is underneath a cached
> PCollection. I can probably take a crack at a patch later this weekend, I
> have a busy couple of days w/the baby and new job and what not. :)
>
> J
>
> On Thu, Oct 8, 2015 at 9:32 AM, Nithin Asokan <anithin19@gmail.com> wrote:
>
>> First I would like to thank everyone on the quick response and fixes on
>> most issues. Great job everyone!
>>
>> I noticed that using cache() on PTable built using SparkPipeline seems to
>> reuse object for downstream DoFn's. Here is an example that exhibits this
>> behavior
>>
>> https://gist.github.com/nasokan/531b4ff9bf827d0835ab
>>
>> I would expect the output of this program to create a pair with same key,
>> value. However, this produces Pair with different key value. I have tested
>> this with text file input source and it works as expected. Removing cache()
>> also produces expected result. So I'm suspecting this issue to be specific
>> to avro and cache().
>>
>> Any thoughts on this behavior?
>>
>> Thank you!
>> Nithin
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message