spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From alberskib <>
Subject Issue with the class generated from avro schema
Date Fri, 09 Oct 2015 16:05:52 GMT
Hi all, 

I have piece of code written in spark that loads data from HDFS into java
classes generated from avro idl. On RDD created in that way I am executing
simple operation which results depends on fact whether I cache RDD before it
or not i.e if I run code below

val loadedData = loadFromHDFS[Data](path,...)
println( => x.getUserId + x.getDate).distinct().count()) //
program will print 200000, on the other hand executing next code

val loadedData = loadFromHDFS[Data](path,...).cache()
println( => x.getUserId + x.getDate).distinct().count()) //
result in 1 printed to stdout.

When I inspect values of the fields after reading cached data it seems

I am pretty sure that root cause of described problem is issue with
serialization of classes generated from avro idl, but I do not know how to
resolve it. I tried to use Kryo, registering generated class (Data),
registering different serializers from chill_avro for given class
(SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none of
those ideas helps me.

I post exactly the same question on stackoverflow but I did not receive any
repsponse.  link

What is more I created minimal working example, thanks to which it will be
easy to reproduce problem.
link <>  

How I can solve this problem?


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message