spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Berman <igor.ber...@gmail.com>
Subject Re: Issue with the class generated from avro schema
Date Fri, 09 Oct 2015 20:36:09 GMT
I think there is deepCopy method of generated avro classes.

On 9 October 2015 at 23:32, Bartłomiej Alberski <alberskib@gmail.com> wrote:

> I knew that one possible solution will be to map loaded object into
> another class just after reading from HDFS.
> I was looking for solution enabling reuse of avro generated classes.
> It could be useful in situation when your record have more 22 records,
> because you do not need to write boilerplate code for mapping from and to
> the class,  i.e loading class as instance of class generated from avro,
> updating some fields, removing duplicates, and saving those results with
> exactly the same schema.
>
> Thank you for the answer, at least I know that there is no way to make it
> works.
>
>
> 2015-10-09 20:19 GMT+02:00 Igor Berman <igor.berman@gmail.com>:
>
>> u should create copy of your avro data before working with it, i.e. just
>> after loadFromHDFS map it into new instance that is deap copy of the object
>> it's connected to the way spark/avro reader reads avro files(it reuses
>> some buffer or something)
>>
>> On 9 October 2015 at 19:05, alberskib <alberskib@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have piece of code written in spark that loads data from HDFS into java
>>> classes generated from avro idl. On RDD created in that way I am
>>> executing
>>> simple operation which results depends on fact whether I cache RDD
>>> before it
>>> or not i.e if I run code below
>>>
>>> val loadedData = loadFromHDFS[Data](path,...)
>>> println(loadedData.map(x => x.getUserId + x.getDate).distinct().count())
>>> //
>>> 200000
>>> program will print 200000, on the other hand executing next code
>>>
>>> val loadedData = loadFromHDFS[Data](path,...).cache()
>>> println(loadedData.map(x => x.getUserId + x.getDate).distinct().count())
>>> //
>>> 1
>>> result in 1 printed to stdout.
>>>
>>> When I inspect values of the fields after reading cached data it seems
>>>
>>> I am pretty sure that root cause of described problem is issue with
>>> serialization of classes generated from avro idl, but I do not know how
>>> to
>>> resolve it. I tried to use Kryo, registering generated class (Data),
>>> registering different serializers from chill_avro for given class
>>> (SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but
>>> none of
>>> those ideas helps me.
>>>
>>> I post exactly the same question on stackoverflow but I did not receive
>>> any
>>> repsponse.  link
>>> <
>>> http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-generated-from-avro-schema
>>> >
>>>
>>> What is more I created minimal working example, thanks to which it will
>>> be
>>> easy to reproduce problem.
>>> link <https://github.com/alberskib/spark-avro-serialization-issue>
>>>
>>> How I can solve this problem?
>>>
>>>
>>> Thanks,
>>> Bartek
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Mime
View raw message