Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AB61018896 for ; Fri, 9 Oct 2015 20:36:50 +0000 (UTC) Received: (qmail 3422 invoked by uid 500); 9 Oct 2015 20:36:46 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 3320 invoked by uid 500); 9 Oct 2015 20:36:46 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 3306 invoked by uid 99); 9 Oct 2015 20:36:46 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Oct 2015 20:36:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 0A7BBC0F5F for ; Fri, 9 Oct 2015 20:36:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.213 X-Spam-Level: **** X-Spam-Status: No, score=4.213 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id bBYZ7pE8o5iT for ; Fri, 9 Oct 2015 20:36:36 +0000 (UTC) Received: from mail-wi0-f195.google.com (mail-wi0-f195.google.com [209.85.212.195]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id D461042B36 for ; Fri, 9 Oct 2015 20:36:35 +0000 (UTC) Received: by wibuq7 with SMTP id uq7so234973wib.1 for ; Fri, 09 Oct 2015 13:36:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=sm+yrtp/kP80ycjCc0Z/iipfdaj15Hh/okYmKRyajas=; b=pvsVqYSPkxpc6nuunlOVvDXjTxwzbWPF58n/kR93Y0toGM2iyJtnPnPoGtDzzY9e8z fvBfgbGFW+UNUJQXW0s3cb9lyNju24YlaLOTMe9xJKsUAXDFvrgbnxIYc36s73kC/AWz 1iTR0D08ufDES81mlXe3K4yHkcSUMUj7ZbhhokG1NDu385qIGvHWtVtHjzey9BpOlBSs Fj5zHuNidoM8r8LloY++JC6vaEO9bbGcfpeigiT5XBjKzHJQ5hrvEaD8OjZLTmoMCsMQ 8bdYC6SYZnRDM/qK7THocwjPJjosc4HpzaBH4j8MWayBa5dx4yPhILAFz6zJeqhwBeyA M0pw== X-Received: by 10.180.106.66 with SMTP id gs2mr1270627wib.14.1444422989278; Fri, 09 Oct 2015 13:36:29 -0700 (PDT) MIME-Version: 1.0 Received: by 10.27.127.196 with HTTP; Fri, 9 Oct 2015 13:36:09 -0700 (PDT) In-Reply-To: References: <1444406752966-24997.post@n3.nabble.com> From: Igor Berman Date: Fri, 9 Oct 2015 23:36:09 +0300 Message-ID: Subject: Re: Issue with the class generated from avro schema To: =?UTF-8?Q?Bart=C5=82omiej_Alberski?= Cc: user Content-Type: multipart/alternative; boundary=f46d04451a0d2e75800521b1ef15 --f46d04451a0d2e75800521b1ef15 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I think there is deepCopy method of generated avro classes. On 9 October 2015 at 23:32, Bart=C5=82omiej Alberski = wrote: > I knew that one possible solution will be to map loaded object into > another class just after reading from HDFS. > I was looking for solution enabling reuse of avro generated classes. > It could be useful in situation when your record have more 22 records, > because you do not need to write boilerplate code for mapping from and to > the class, i.e loading class as instance of class generated from avro, > updating some fields, removing duplicates, and saving those results with > exactly the same schema. > > Thank you for the answer, at least I know that there is no way to make it > works. > > > 2015-10-09 20:19 GMT+02:00 Igor Berman : > >> u should create copy of your avro data before working with it, i.e. just >> after loadFromHDFS map it into new instance that is deap copy of the obj= ect >> it's connected to the way spark/avro reader reads avro files(it reuses >> some buffer or something) >> >> On 9 October 2015 at 19:05, alberskib wrote: >> >>> Hi all, >>> >>> I have piece of code written in spark that loads data from HDFS into ja= va >>> classes generated from avro idl. On RDD created in that way I am >>> executing >>> simple operation which results depends on fact whether I cache RDD >>> before it >>> or not i.e if I run code below >>> >>> val loadedData =3D loadFromHDFS[Data](path,...) >>> println(loadedData.map(x =3D> x.getUserId + x.getDate).distinct().count= ()) >>> // >>> 200000 >>> program will print 200000, on the other hand executing next code >>> >>> val loadedData =3D loadFromHDFS[Data](path,...).cache() >>> println(loadedData.map(x =3D> x.getUserId + x.getDate).distinct().count= ()) >>> // >>> 1 >>> result in 1 printed to stdout. >>> >>> When I inspect values of the fields after reading cached data it seems >>> >>> I am pretty sure that root cause of described problem is issue with >>> serialization of classes generated from avro idl, but I do not know how >>> to >>> resolve it. I tried to use Kryo, registering generated class (Data), >>> registering different serializers from chill_avro for given class >>> (SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but >>> none of >>> those ideas helps me. >>> >>> I post exactly the same question on stackoverflow but I did not receive >>> any >>> repsponse. link >>> < >>> http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-= generated-from-avro-schema >>> > >>> >>> What is more I created minimal working example, thanks to which it will >>> be >>> easy to reproduce problem. >>> link >>> >>> How I can solve this problem? >>> >>> >>> Thanks, >>> Bartek >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-the-clas= s-generated-from-avro-schema-tp24997.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com= . >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org >>> For additional commands, e-mail: user-help@spark.apache.org >>> >>> >> > --f46d04451a0d2e75800521b1ef15 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I think there is deepCopy method of generated avro classes= .

On 9 Octob= er 2015 at 23:32, Bart=C5=82omiej Alberski <alberskib@gmail.com><= /span> wrote:
= I knew that one possible solution will be to map loaded object into another= class just after reading from HDFS.
I was looking for solution e= nabling reuse of avro generated classes.=C2=A0
It could be useful= in situation when your record have more 22 records, because you do not nee= d to write boilerplate code for mapping from and to the class, =C2=A0i.e lo= ading class as instance of class generated from avro, updating some fields,= removing duplicates, and saving those results with exactly the same schema= .=C2=A0

Thank you for the answer, at least I know = that there is no way to make it works.


2015-10-09 20:19 GMT+02:00 Igor Berman <igor.berman= @gmail.com>:
u should create copy of your avro data before working with it, i.e. just= after loadFromHDFS map it into new instance that is deap copy of the objec= t
it's connected to the way spark/avro reader reads avro files(it r= euses some buffer or something)

<= div class=3D"gmail_quote">On 9 October 2015 at 19:05, alberskib <alber= skib@gmail.com> wrote:
Hi a= ll,

I have piece of code written in spark that loads data from HDFS into java classes generated from avro idl. On RDD created in that way I am executing<= br> simple operation which results depends on fact whether I cache RDD before i= t
or not i.e if I run code below

val loadedData =3D loadFromHDFS[Data](path,...)
println(loadedData.map(x =3D> x.getUserId + x.getDate).distinct().count(= )) //
200000
program will print 200000, on the other hand executing next code

val loadedData =3D loadFromHDFS[Data](path,...).cache()
println(loadedData.map(x =3D> x.getUserId + x.getDate).distinct().count(= )) //
1
result in 1 printed to stdout.

When I inspect values of the fields after reading cached data it seems

I am pretty sure that root cause of described problem is issue with
serialization of classes generated from avro idl, but I do not know how to<= br> resolve it. I tried to use Kryo, registering generated class (Data),
registering different serializers from chill_avro for given class
(SpecificRecordSerializer, SpecificRecordBinarySerializer, etc), but none o= f
those ideas helps me.

I post exactly the same question on stackoverflow but I did not receive any=
repsponse.=C2=A0 link
<http://stackoverflow.com/questions/33027851/spark-issue-with-the-class-gen= erated-from-avro-schema>

What is more I created minimal working example, thanks to which it will be<= br> easy to reproduce problem.
link <https://github.com/alberskib/spa= rk-avro-serialization-issue>

How I can solve this problem?


Thanks,
Bartek



--
View this message in context: http://apache-spark-user-list.100= 1560.n3.nabble.com/Issue-with-the-class-generated-from-avro-schema-tp24997.= html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org




--f46d04451a0d2e75800521b1ef15--