Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3FF5118881 for ; Fri, 9 Oct 2015 20:33:12 +0000 (UTC) Received: (qmail 90813 invoked by uid 500); 9 Oct 2015 20:33:08 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 90718 invoked by uid 500); 9 Oct 2015 20:33:08 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 90708 invoked by uid 99); 9 Oct 2015 20:33:08 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Oct 2015 20:33:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 095B2180A75 for ; Fri, 9 Oct 2015 20:33:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.213 X-Spam-Level: **** X-Spam-Status: No, score=4.213 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id q5dCW8i-rFYz for ; Fri, 9 Oct 2015 20:32:57 +0000 (UTC) Received: from mail-wi0-f196.google.com (mail-wi0-f196.google.com [209.85.212.196]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 01A0923031 for ; Fri, 9 Oct 2015 20:32:57 +0000 (UTC) Received: by wibzt1 with SMTP id zt1so4476665wib.3 for ; Fri, 09 Oct 2015 13:32:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=Fvv9+9VDCI3yLInQPpcCZ/dEKo8NcoGmirGITt8UFdo=; b=TDw5bFbLePv5hk6RjYRBKZlpFlPM/lDjK6wK7BMi6RIoXJ9vu5Drcwc4Mw4KMOhp2N mLRCaoZyuqqePdDGbxwliYro2YC/R1MRT44PWvmrKUEiSTFsmdsJtOB/XyxG14BavyA3 6089LtZGkl4zCPUiEfZbcICPSkv529fnOyEqTyQYJJ9wvT4nkeYJ1vZSPscipHt1BSL8 ahDe9JR15Jh7osFwHPQWF9uFKAkHYA+C2mah5KzmMk7j+EbcbqNDn5abx+uDHfZdmVTA T7SyR+l2U8XBkxYXeDml/u9ogS3tSE1Ay471rAv1lfjRkCtMiX0MOVd5wSuu1muLSZUX mPWg== X-Received: by 10.194.87.74 with SMTP id v10mr16076605wjz.114.1444422770145; Fri, 09 Oct 2015 13:32:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.62.82 with HTTP; Fri, 9 Oct 2015 13:32:30 -0700 (PDT) In-Reply-To: References: <1444406752966-24997.post@n3.nabble.com> From: =?UTF-8?Q?Bart=C5=82omiej_Alberski?= Date: Fri, 9 Oct 2015 22:32:30 +0200 Message-ID: Subject: Re: Issue with the class generated from avro schema To: Igor Berman Cc: user Content-Type: multipart/alternative; boundary=047d7bf109dc1ebec60521b1e279 --047d7bf109dc1ebec60521b1e279 Content-Type: text/plain; charset=UTF-8 I knew that one possible solution will be to map loaded object into another class just after reading from HDFS. I was looking for solution enabling reuse of avro generated classes. It could be useful in situation when your record have more 22 records, because you do not need to write boilerplate code for mapping from and to the class, i.e loading class as instance of class generated from avro, updating some fields, removing duplicates, and saving those results with exactly the same schema. Thank you for the answer, at least I know that there is no way to make it works. 2015-10-09 20:19 GMT+02:00 Igor Berman : > u should create copy of your avro data before working with it, i.e. just > after loadFromHDFS map it into new instance that is deap copy of the object > it's connected to the way spark/avro reader reads avro files(it reuses > some buffer or something) > > On 9 October 2015 at 19:05, alberskib wrote: > >> Hi all, >> >> I have piece of code written in spark that loads data from HDFS into java >> classes generated from avro idl. On RDD created in that way I am executing >> simple operation which results depends on fact whether I cache RDD before >> it >> or not i.e if I run code below >> >> val loadedData = loadFromHDFS[Data](path,...) >> println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) >> // >> 200000 >> program will print 200000, on the other hand executing next code >> >> val loadedData = loadFromHDFS[Data](path,...).cache() >> println(loadedData.map(x => x.getUserId + x.getDate).distinct().count()) >> // >> 1 >> result in 1 printed to stdout. >> >> When I inspect values of the fields after reading cached data it seems >> >> I am pretty sure that root cause of described problem is issue with >> serialization of classes generated from avro idl, but I do not know how to >> resolve it. I tried to use Kryo, registering generated class (Data), >> registering different serializers from chill_avro for given class >> (SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none >> of >> those ideas helps me. >> >> I post exactly the same question on stackoverflow but I did not receive >> any >> repsponse. link >> < >> http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-generated-from-avro-schema >> > >> >> What is more I created minimal working example, thanks to which it will be >> easy to reproduce problem. >> link >> >> How I can solve this problem? >> >> >> Thanks, >> Bartek >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org >> For additional commands, e-mail: user-help@spark.apache.org >> >> > --047d7bf109dc1ebec60521b1e279 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I knew that one possible solution will be to map= loaded object into another class just after reading from HDFS.
I= was looking for solution enabling reuse of avro generated classes.=C2=A0
It could be useful in situation when your record have more 22 reco= rds, because you do not need to write boilerplate code for mapping from and= to the class, =C2=A0i.e loading class as instance of class generated from = avro, updating some fields, removing duplicates, and saving those results w= ith exactly the same schema.=C2=A0

Thank you for t= he answer, at least I know that there is no way to make it works.


2015-10-09 20:19 GMT+02:00 Igor Berman <igor.berman@gmail.com>:
u should = create copy of your avro data before working with it, i.e. just after loadF= romHDFS map it into new instance that is deap copy of the object
it'= ;s connected to the way spark/avro reader reads avro files(it reuses some b= uffer or something)

On 9 October 2015 at 19:05, alberskib <<= a href=3D"mailto:alberskib@gmail.com" target=3D"_blank">alberskib@gmail.com= > wrote:
Hi all,

I have piece of code written in spark that loads data from HDFS into java classes generated from avro idl. On RDD created in that way I am executing<= br> simple operation which results depends on fact whether I cache RDD before i= t
or not i.e if I run code below

val loadedData =3D loadFromHDFS[Data](path,...)
println(loadedData.map(x =3D> x.getUserId + x.getDate).distinct().count(= )) //
200000
program will print 200000, on the other hand executing next code

val loadedData =3D loadFromHDFS[Data](path,...).cache()
println(loadedData.map(x =3D> x.getUserId + x.getDate).distinct().count(= )) //
1
result in 1 printed to stdout.

When I inspect values of the fields after reading cached data it seems

I am pretty sure that root cause of described problem is issue with
serialization of classes generated from avro idl, but I do not know how to<= br> resolve it. I tried to use Kryo, registering generated class (Data),
registering different serializers from chill_avro for given class
(SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none o= f
those ideas helps me.

I post exactly the same question on stackoverflow but I did not receive any=
repsponse.=C2=A0 link
<http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-gen= erated-from-avro-schema>

What is more I created minimal working example, thanks to which it will be<= br> easy to reproduce problem.
link <https://github.com/alberskib/spa= rk-avro-serialization-issue>

How I can solve this problem?


Thanks,
Bartek



--
View this message in context: http://apache-spark-user-list.100= 1560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.= html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org



--047d7bf109dc1ebec60521b1e279--