Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CAGUzGxxMnE1juyf+vWWpo55epbMqpuPdnw45pSYz8BB0MAY12g@mail.gmail.com>
References: <1444406752966-24997.post@n3.nabble.com>
 <CAGNRdydxnJwomQaQTF6C6ppHHpOt+2cTZzunX-GZZXBbgiEPUQ@mail.gmail.com>
 <CAGUzGxxMnE1juyf+vWWpo55epbMqpuPdnw45pSYz8BB0MAY12g@mail.gmail.com>
From: Igor Berman <igor.berman@gmail.com>
Date: Fri, 9 Oct 2015 23:36:09 +0300
Message-ID: 
 <CAGNRdyctnyjBpLRw2jXSa14TvnURY6uB29QjiSw6aY+4umhSpw@mail.gmail.com>
Subject: Re: Issue with the class generated from avro schema
To: =?UTF-8?Q?Bart=C5=82omiej_Alberski?= <alberskib@gmail.com>
Cc: user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=f46d04451a0d2e75800521b1ef15

--f46d04451a0d2e75800521b1ef15
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I think there is deepCopy method of generated avro classes.

On 9 October 2015 at 23:32, Bart=C5=82omiej Alberski <alberskib@gmail.com> =
wrote:

> I knew that one possible solution will be to map loaded object into
> another class just after reading from HDFS.
> I was looking for solution enabling reuse of avro generated classes.
> It could be useful in situation when your record have more 22 records,
> because you do not need to write boilerplate code for mapping from and to
> the class,  i.e loading class as instance of class generated from avro,
> updating some fields, removing duplicates, and saving those results with
> exactly the same schema.
>
> Thank you for the answer, at least I know that there is no way to make it
> works.
>
>
> 2015-10-09 20:19 GMT+02:00 Igor Berman <igor.berman@gmail.com>:
>
>> u should create copy of your avro data before working with it, i.e. just
>> after loadFromHDFS map it into new instance that is deap copy of the obj=
ect
>> it's connected to the way spark/avro reader reads avro files(it reuses
>> some buffer or something)
>>
>> On 9 October 2015 at 19:05, alberskib <alberskib@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have piece of code written in spark that loads data from HDFS into ja=
va
>>> classes generated from avro idl. On RDD created in that way I am
>>> executing
>>> simple operation which results depends on fact whether I cache RDD
>>> before it
>>> or not i.e if I run code below
>>>
>>> val loadedData =3D loadFromHDFS[Data](path,...)
>>> println(loadedData.map(x =3D> x.getUserId + x.getDate).distinct().count=
())
>>> //
>>> 200000
>>> program will print 200000, on the other hand executing next code
>>>
>>> val loadedData =3D loadFromHDFS[Data](path,...).cache()
>>> println(loadedData.map(x =3D> x.getUserId + x.getDate).distinct().count=
())
>>> //
>>> 1
>>> result in 1 printed to stdout.
>>>
>>> When I inspect values of the fields after reading cached data it seems
>>>
>>> I am pretty sure that root cause of described problem is issue with
>>> serialization of classes generated from avro idl, but I do not know how
>>> to
>>> resolve it. I tried to use Kryo, registering generated class (Data),
>>> registering different serializers from chill_avro for given class
>>> (SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but
>>> none of
>>> those ideas helps me.
>>>
>>> I post exactly the same question on stackoverflow but I did not receive
>>> any
>>> repsponse.  link
>>> <
>>> http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-=
generated-from-avro-schema
>>> >
>>>
>>> What is more I created minimal working example, thanks to which it will
>>> be
>>> easy to reproduce problem.
>>> link <https://github.com/alberskib/spark-avro-serialization-issue>
>>>
>>> How I can solve this problem?
>>>
>>>
>>> Thanks,
>>> Bartek
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-the-clas=
s-generated-from-avro-schema-tp24997.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com=
.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

--f46d04451a0d2e75800521b1ef15
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I think there is deepCopy method of generated avro classes=
.</div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On 9 Octob=
er 2015 at 23:32, Bart=C5=82omiej Alberski <span dir=3D"ltr">&lt;<a href=3D=
"mailto:alberskib@gmail.com" target=3D"_blank">alberskib@gmail.com</a>&gt;<=
/span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>=
I knew that one possible solution will be to map loaded object into another=
 class just after reading from HDFS.</div><div>I was looking for solution e=
nabling reuse of avro generated classes.=C2=A0</div><div>It could be useful=
 in situation when your record have more 22 records, because you do not nee=
d to write boilerplate code for mapping from and to the class, =C2=A0i.e lo=
ading class as instance of class generated from avro, updating some fields,=
 removing duplicates, and saving those results with exactly the same schema=
.=C2=A0</div><div><br></div><div>Thank you for the answer, at least I know =
that there is no way to make it works.</div></div><div><br></div></div><div=
 class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div cla=
ss=3D"gmail_quote">2015-10-09 20:19 GMT+02:00 Igor Berman <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:igor.berman@gmail.com" target=3D"_blank">igor.berman=
@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"lt=
r">u should create copy of your avro data before working with it, i.e. just=
 after loadFromHDFS map it into new instance that is deap copy of the objec=
t<div>it&#39;s connected to the way spark/avro reader reads avro files(it r=
euses some buffer or something)</div></div><div class=3D"gmail_extra"><br><=
div class=3D"gmail_quote">On 9 October 2015 at 19:05, alberskib <span dir=
=3D"ltr">&lt;<a href=3D"mailto:alberskib@gmail.com" target=3D"_blank">alber=
skib@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi a=
ll,<br>
<br>
I have piece of code written in spark that loads data from HDFS into java<b=
r>
classes generated from avro idl. On RDD created in that way I am executing<=
br>
simple operation which results depends on fact whether I cache RDD before i=
t<br>
or not i.e if I run code below<br>
<br>
val loadedData =3D loadFromHDFS[Data](path,...)<br>
println(loadedData.map(x =3D&gt; x.getUserId + x.getDate).distinct().count(=
)) //<br>
200000<br>
program will print 200000, on the other hand executing next code<br>
<br>
val loadedData =3D loadFromHDFS[Data](path,...).cache()<br>
println(loadedData.map(x =3D&gt; x.getUserId + x.getDate).distinct().count(=
)) //<br>
1<br>
result in 1 printed to stdout.<br>
<br>
When I inspect values of the fields after reading cached data it seems<br>
<br>
I am pretty sure that root cause of described problem is issue with<br>
serialization of classes generated from avro idl, but I do not know how to<=
br>
resolve it. I tried to use Kryo, registering generated class (Data),<br>
registering different serializers from chill_avro for given class<br>
(SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none o=
f<br>
those ideas helps me.<br>
<br>
I post exactly the same question on stackoverflow but I did not receive any=
<br>
repsponse.=C2=A0 link<br>
&lt;<a href=3D"http://stackoverflow.com/questions/33027851/spark-issue-with=
-the-class-generated-from-avro-schema" rel=3D"noreferrer" target=3D"_blank"=
>http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-gen=
erated-from-avro-schema</a>&gt;<br>
<br>
What is more I created minimal working example, thanks to which it will be<=
br>
easy to reproduce problem.<br>
link &lt;<a href=3D"https://github.com/alberskib/spark-avro-serialization-i=
ssue" rel=3D"noreferrer" target=3D"_blank">https://github.com/alberskib/spa=
rk-avro-serialization-issue</a>&gt;<br>
<br>
How I can solve this problem?<br>
<br>
<br>
Thanks,<br>
Bartek<br>
<br>
<br><span><font color=3D"#888888">
<br>
--<br>
View this message in context: <a href=3D"http://apache-spark-user-list.1001=
560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.h=
tml" rel=3D"noreferrer" target=3D"_blank">http://apache-spark-user-list.100=
1560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.=
html</a><br>
Sent from the Apache Spark User List mailing list archive at Nabble.com.<br=
>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:user-unsubscribe@spark.apache.org=
" target=3D"_blank">user-unsubscribe@spark.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:user-help@spark.apache.o=
rg" target=3D"_blank">user-help@spark.apache.org</a><br>
<br>
</font></span></blockquote></div><br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--f46d04451a0d2e75800521b1ef15--