Mailing-List: contact user-help@avro.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@avro.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAN-NQFacXrzUeKQb9XrFRiS82rbNcjTPCdhi+dvFBV3B_jpD2Q@mail.gmail.com>
References: 
 <CAN-NQFZD=iQhGS4CqkpHZmYVYEdpxNF8Wvuq1bqmeXxFgSz=sg@mail.gmail.com>
	<CAO6JcphGpt=JROjvEK_J6vCo9rhwrhSV2Vji7PD77-dj2MSF-Q@mail.gmail.com>
	<CAN-NQFacXrzUeKQb9XrFRiS82rbNcjTPCdhi+dvFBV3B_jpD2Q@mail.gmail.com>
Date: Sat, 12 Mar 2016 18:02:28 +0800
Message-ID: 
 <CAN-NQFZZYMFXoRTSHrpbpsprzzex4YqW-4oJcDuFCFiDBCyGrw@mail.gmail.com>
Subject: Re: Repeating Records w/ Spark + Avro?
From: Chris Miller <cmiller11101@gmail.com>
To: Peyman Mohajerian <mohajeri@gmail.com>
Cc: user@avro.apache.org, user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a11416af831d57d052dd7250f

--001a11416af831d57d052dd7250f
Content-Type: text/plain; charset=UTF-8

Well, I kind of got it... this works below:

*****************
val rdd = sc.newAPIHadoopFile(path, classOf[AvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]], classOf[NullWritable]).map(_._1.datum)

rdd
  .map(item => {
    val item = i.copy()
    val record = i._1.datum()

    println(record.get("myValue"))
  })
  .take(10)
*****************

Seems strange to me that I have to iterate over the RDD effectively two
times -- one to create the RDD, and another to perform my action. It also
seems strange that I can't actually access the data in my RDD until I've
copied the records. I would think this is a *very* common use case of an
RDD -- accessing the data it contains (otherwise, what's the point?).

Is there a way to always enable cloning? There used to be a cloneRecords
parameter on the hadoopfile method, but that seems to have been removed.

Finally, if I add rdd.persist(), then it doesn't work. I guess I would need
to do .map(_._1.datum) again before the map that does the real work.


--
Chris Miller

On Sat, Mar 12, 2016 at 4:15 PM, Chris Miller <cmiller11101@gmail.com>
wrote:

> Wow! That sure is buried in the documentation! But yeah, that's what I
> thought more or less.
>
> I tried copying as follows, but that didn't work.
>
> *****************
> val copyRDD = singleFileRDD.map(_.copy())
> *****************
>
> When I iterate over the new copyRDD (foreach or map), I still have the
> same problem of duplicate records. I also tried copying within the block
> where I'm using it, but that didn't work either:
>
> *****************
> rdd
>   .take(10)
>   .collect()
>   .map(item => {
>     val item = i.copy()
>     val record = i._1.datum()
>
>     println(record.get("myValue"))
>   })
> *****************
>
> What am I doing wrong?
>
> --
> Chris Miller
>
> On Sat, Mar 12, 2016 at 1:48 PM, Peyman Mohajerian <mohajeri@gmail.com>
> wrote:
>
>> Here is the reason for the behavior:
>> '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
>> object for each record, directly caching the returned RDD or directly
>> passing it to an aggregation or shuffle operation will create many
>> references to the same object. If you plan to directly cache, sort, or
>> aggregate Hadoop writable objects, you should first copy them using a map
>>  function.
>>
>>
>> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html
>>
>> So it is Hadoop related.
>>
>> On Fri, Mar 11, 2016 at 3:19 PM, Chris Miller <cmiller11101@gmail.com>
>> wrote:
>>
>>> I have a bit of a strange situation:
>>>
>>> *****************
>>> import org.apache.avro.generic.{GenericData, GenericRecord}
>>> import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper, AvroKey}
>>> import org.apache.avro.mapreduce.AvroKeyInputFormat
>>> import org.apache.hadoop.io.{NullWritable, WritableUtils}
>>>
>>> val path = "/path/to/data.avro"
>>>
>>> val rdd = sc.newAPIHadoopFile(path,
>>> classOf[AvroKeyInputFormat[GenericRecord]],
>>> classOf[AvroKey[GenericRecord]], classOf[NullWritable])
>>> rdd.take(10).foreach( x => println( x._1.datum() ))
>>> *****************
>>>
>>> In this situation, I get the right number of records returned, and if I
>>> look at the contents of rdd I see the individual records as tuple2's...
>>> however, if I println on each one as shown above, I get the same result
>>> every time.
>>>
>>> Apparently this has to do with something in Spark or Avro keeping a
>>> reference to the item its iterating over, so I need to clone the object
>>> before I use it. However, if I try to clone it (from the spark-shell
>>> console), I get:
>>>
>>> *****************
>>> rdd.take(10).foreach( x => {
>>>   val clonedDatum = x._1.datum().clone()
>>>   println(clonedDatum.datum())
>>> })
>>>
>>> <console>:37: error: method clone in class Object cannot be accessed in
>>> org.apache.avro.generic.GenericRecord
>>>  Access to protected method clone not permitted because
>>>  prefix type org.apache.avro.generic.GenericRecord does not conform to
>>>  class $iwC where the access take place
>>>                 val clonedDatum = x._1.datum().clone()
>>> *****************
>>>
>>> So, how can I clone the datum?
>>>
>>> Seems I'm not the only one who ran into this problem:
>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/102. I
>>> can't figure out how to fix it in my case without hacking away like the
>>> person in the linked PR did.
>>>
>>> Suggestions?
>>>
>>> --
>>> Chris Miller
>>>
>>
>>
>

--001a11416af831d57d052dd7250f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Well, I kind of got it... this works below:<div><br></div>=
<div><div style=3D"font-size:12.8px"><span style=3D"color:rgb(80,0,80)">***=
**************</span><br></div><div style=3D"font-size:12.8px"><div><span s=
tyle=3D"font-family:monospace,monospace;font-size:12.8px">val rdd =3D sc.ne=
wAPIHadoopFile(path, classOf[AvroKeyInputFormat[</span><span style=3D"font-=
family:monospace,monospace;font-size:12.8px">GenericRecord]], classOf[AvroK=
ey[GenericRecord]</span><span style=3D"font-family:monospace,monospace;font=
-size:12.8px">], classOf[NullWritable])</span><span style=3D"font-family:mo=
nospace,monospace;font-size:12.8px">.map(_._1.datum)</span><font face=3D"mo=
nospace, monospace"><br></font></div><div><span style=3D"font-family:monosp=
ace,monospace;font-size:12.8px"><br></span></div><div><font face=3D"monospa=
ce, monospace">rdd</font></div><div><font face=3D"monospace, monospace">=C2=
=A0 .map(item =3D&gt; {</font></div><div><font face=3D"monospace, monospace=
">=C2=A0 =C2=A0 val item =3D i.copy()</font></div><div><font face=3D"monosp=
ace, monospace">=C2=A0 =C2=A0 val record =3D i._1.datum()</font></div><div>=
<font face=3D"monospace, monospace"><br></font></div><div><font face=3D"mon=
ospace, monospace">=C2=A0 =C2=A0 println(record.get(&quot;myValue&quot;))</=
font></div><div><font face=3D"monospace, monospace">=C2=A0 })</font></div><=
div><font face=3D"monospace, monospace">=C2=A0=C2=A0</font><span style=3D"f=
ont-family:monospace,monospace;font-size:12.8px">.take(10)</span></div></di=
v><div style=3D"font-size:12.8px"><span style=3D"color:rgb(80,0,80)">******=
***********</span></div></div><div style=3D"font-size:12.8px"><span style=
=3D"color:rgb(80,0,80)"><br></span></div><div style=3D"font-size:12.8px"><s=
pan style=3D"color:rgb(80,0,80)">Seems strange to me that I have to iterate=
 over the RDD effectively two times -- one to create the RDD, and another t=
o perform my action. It also seems strange that I can&#39;t actually access=
 the data in my RDD until I&#39;ve copied the records. I would think this i=
s a <i>very</i> common use case of an RDD -- accessing the data it contains=
 (otherwise, what&#39;s the point?).</span></div><div style=3D"font-size:12=
.8px"><span style=3D"color:rgb(80,0,80)"><br></span></div><div style=3D"fon=
t-size:12.8px"><span style=3D"color:rgb(80,0,80)">Is there a way to always =
enable cloning? There used to be a cloneRecords parameter on the hadoopfile=
 method, but that seems to have been removed.</span></div><div style=3D"fon=
t-size:12.8px"><span style=3D"color:rgb(80,0,80)"><br></span></div><div sty=
le=3D"font-size:12.8px"><span style=3D"color:rgb(80,0,80)">Finally, if I ad=
d=C2=A0</span><span style=3D"font-family:monospace,monospace;font-size:12.8=
px">rdd.persist()</span><span style=3D"color:rgb(80,0,80);font-size:12.8px"=
>, then it doesn&#39;t work. I guess I would need to do=C2=A0</span><span s=
tyle=3D"font-family:monospace,monospace;font-size:12.8px">.map(_._1.datum)<=
/span><span style=3D"color:rgb(80,0,80);font-size:12.8px">=C2=A0again befor=
e the map that does the real work.</span></div><div style=3D"font-size:12.8=
px"><span style=3D"color:rgb(80,0,80);font-size:12.8px"><br></span></div></=
div><div class=3D"gmail_extra"><br clear=3D"all"><div><div class=3D"gmail_s=
ignature"><div dir=3D"ltr"><div>--</div><div>Chris Miller</div></div></div>=
</div>
<br><div class=3D"gmail_quote">On Sat, Mar 12, 2016 at 4:15 PM, Chris Mille=
r <span dir=3D"ltr">&lt;<a href=3D"mailto:cmiller11101@gmail.com" target=3D=
"_blank">cmiller11101@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr">Wow! That sure is buried in the documentatio=
n! But yeah, that&#39;s what I thought more or less.<div><br></div><div>I t=
ried copying as follows, but that didn&#39;t work.</div><div><br></div><div=
><span style=3D"color:rgb(80,0,80)">*****************</span><br></div><div>=
<font color=3D"#500050" face=3D"monospace, monospace">val copyRDD =3D singl=
eFileRDD.map(_.copy())</font><br></div><div><span style=3D"color:rgb(80,0,8=
0)">*****************</span><span style=3D"color:rgb(80,0,80)"><br></span><=
/div><div><span style=3D"color:rgb(80,0,80)"><br></span></div><div>When I i=
terate over the new=C2=A0<span style=3D"color:rgb(80,0,80);font-family:mono=
space,monospace">copyRDD</span>=C2=A0(foreach or map), I still have the sam=
e problem of duplicate records. I also tried copying within the block where=
 I&#39;m using it, but that didn&#39;t work either:<span style=3D"color:rgb=
(80,0,80)"><br></span></div><div><br></div><div><div><span style=3D"color:r=
gb(80,0,80)">*****************</span><br></div><div><div><font face=3D"mono=
space, monospace">rdd</font></div><div><font face=3D"monospace, monospace">=
=C2=A0 .take(10)</font></div><div><font face=3D"monospace, monospace">=C2=
=A0 .collect()</font></div><div><font face=3D"monospace, monospace">=C2=A0 =
.map(item =3D&gt; {</font></div><div><font face=3D"monospace, monospace">=
=C2=A0 =C2=A0 val item =3D i.copy()</font></div><div><font face=3D"monospac=
e, monospace">=C2=A0 =C2=A0 val record =3D i._1.datum()</font></div><div><f=
ont face=3D"monospace, monospace"><br></font></div><div><font face=3D"monos=
pace, monospace">=C2=A0 =C2=A0 println(record.get(&quot;myValue&quot;))</fo=
nt></div><div><font face=3D"monospace, monospace">=C2=A0 })</font></div></d=
iv><div><span style=3D"color:rgb(80,0,80)">*****************</span></div></=
div><div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">What am=
 I doing wrong?</div><div class=3D"gmail_extra"><br clear=3D"all"><div><div=
><div dir=3D"ltr"><div>--</div><div>Chris Miller</div></div></div></div><di=
v><div class=3D"h5">
<br><div class=3D"gmail_quote">On Sat, Mar 12, 2016 at 1:48 PM, Peyman Moha=
jerian <span dir=3D"ltr">&lt;<a href=3D"mailto:mohajeri@gmail.com" target=
=3D"_blank">mohajeri@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;bo=
rder-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">=
<div dir=3D"ltr">Here is the reason for the behavior:<div><span style=3D"co=
lor:rgb(53,56,51);font-family:Arial,Helvetica,sans-serif;font-size:12.16px"=
>&#39;&#39;&#39;Note:&#39;&#39;&#39; Because Hadoop&#39;s RecordReader clas=
s re-uses the same Writable object for each record, directly caching the re=
turned RDD or directly passing it to an aggregation or shuffle operation wi=
ll create many references to the same object. If you plan to directly cache=
, sort, or aggregate Hadoop writable objects, you should first copy them us=
ing a=C2=A0</span><code style=3D"font-size:1.2em;color:rgb(53,56,51)">map</=
code><span style=3D"color:rgb(53,56,51);font-family:Arial,Helvetica,sans-se=
rif;font-size:12.16px">=C2=A0function.</span><br></div><div><span style=3D"=
color:rgb(53,56,51);font-family:Arial,Helvetica,sans-serif;font-size:12.16p=
x"><br></span></div><div><font color=3D"#353833" face=3D"Arial, Helvetica, =
sans-serif"><span style=3D"font-size:12.16px"><a href=3D"https://spark.apac=
he.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html" target=3D"_b=
lank">https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkCo=
ntext.html</a></span></font><br></div><div><font color=3D"#353833" face=3D"=
Arial, Helvetica, sans-serif"><span style=3D"font-size:12.16px"><br></span>=
</font></div><div><font color=3D"#353833" face=3D"Arial, Helvetica, sans-se=
rif"><span style=3D"font-size:12.16px">So it is Hadoop related.</span></fon=
t></div></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_=
quote">On Fri, Mar 11, 2016 at 3:19 PM, Chris Miller <span dir=3D"ltr">&lt;=
<a href=3D"mailto:cmiller11101@gmail.com" target=3D"_blank">cmiller11101@gm=
ail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D=
"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;bor=
der-left-color:rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div><di=
v>I have a bit of a strange situation:</div><div><br></div><div>***********=
******</div><div><font face=3D"monospace, monospace">import org.apache.avro=
.generic.{GenericData, GenericRecord}</font></div><div><font face=3D"monosp=
ace, monospace">import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper=
, AvroKey}</font></div><div><font face=3D"monospace, monospace">import org.=
apache.avro.mapreduce.AvroKeyInputFormat</font></div><div><font face=3D"mon=
ospace, monospace">import org.apache.hadoop.io.{NullWritable, WritableUtils=
}</font></div><div><font face=3D"monospace, monospace"><br></font></div><di=
v><font face=3D"monospace, monospace">val path =3D &quot;/path/to/data.avro=
&quot;</font></div><div><font face=3D"monospace, monospace"><br></font></di=
v><div><font face=3D"monospace, monospace">val rdd =3D sc.newAPIHadoopFile(=
path, classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRe=
cord]], classOf[NullWritable])</font></div><div><font face=3D"monospace, mo=
nospace">rdd.take(10).foreach( x =3D&gt; println( x._1.datum() ))</font></d=
iv><div>*****************<br></div><div><br></div><div>In this situation, I=
 get the right number of records returned, and if I look at the contents of=
 rdd I see the individual records as tuple2&#39;s... however, if I=C2=A0<sp=
an style=3D"font-family:monospace,monospace">println</span>=C2=A0on each on=
e as shown above, I get the same=C2=A0result every time.</div><div><br></di=
v><div>Apparently this has to do with something in Spark or Avro keeping a =
reference to the item its iterating over, so I need to clone the object bef=
ore I use it. However, if I try to clone it (from the spark-shell console),=
 I get:</div><div><br></div><div>*****************<br></div><div>rdd.take(1=
0).foreach( x =3D&gt; {</div><div>=C2=A0 val clonedDatum =3D x._1.datum().c=
lone()</div><div>=C2=A0 println(clonedDatum.datum())</div><div>})</div><div=
><br></div><div>&lt;console&gt;:37: error: method clone in class Object can=
not be accessed in org.apache.avro.generic.GenericRecord</div><div>=C2=A0Ac=
cess to protected method clone not permitted because</div><div>=C2=A0prefix=
 type org.apache.avro.generic.GenericRecord does not conform to</div><div>=
=C2=A0class $iwC where the access take place</div><div>=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 val clonedDatum =3D x._1.datum().clone(=
)</div><div>*****************<br></div><div><br></div><div>So, how can I cl=
one the datum?</div><div><br></div><div>Seems I&#39;m not the only one who =
ran into this problem: <a href=3D"https://github.com/GoogleCloudPlatform/Da=
taflowJavaSDK/issues/102" target=3D"_blank">https://github.com/GoogleCloudP=
latform/DataflowJavaSDK/issues/102</a>. I can&#39;t figure out how to fix i=
t in my case without hacking away like the person in the linked PR did.</di=
v><div><br></div><div>Suggestions?</div><div><br></div><div>--</div><div>Ch=
ris Miller</div></div>
</div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>

--001a11416af831d57d052dd7250f--