crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nithin Asokan <anithi...@gmail.com>
Subject SparkPipeline possible avro reuse on cache()
Date Thu, 08 Oct 2015 16:32:03 GMT
First I would like to thank everyone on the quick response and fixes on
most issues. Great job everyone!

I noticed that using cache() on PTable built using SparkPipeline seems to
reuse object for downstream DoFn's. Here is an example that exhibits this
behavior

https://gist.github.com/nasokan/531b4ff9bf827d0835ab

I would expect the output of this program to create a pair with same key,
value. However, this produces Pair with different key value. I have tested
this with text file input source and it works as expected. Removing cache()
also produces expected result. So I'm suspecting this issue to be specific
to avro and cache().

Any thoughts on this behavior?

Thank you!
Nithin

Mime
View raw message