Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <BY2PR11MB08240064A34B2917E079FBAC8F4D0@BY2PR11MB0824.namprd11.prod.outlook.com>
References: 
 <BY2PR11MB0824529FAE05591F302865398F4D0@BY2PR11MB0824.namprd11.prod.outlook.com>
	<BY2PR11MB08245FA97D793B90560A7BE58F4D0@BY2PR11MB0824.namprd11.prod.outlook.com>
	<BY2PR11MB08240064A34B2917E079FBAC8F4D0@BY2PR11MB0824.namprd11.prod.outlook.com>
Date: Thu, 1 Oct 2015 12:51:14 +0530
Message-ID: 
 <CANACBJgxxKASe7Qr3muq=xvdB1RrLRebsCd2fcEWB4s4=fGsfA@mail.gmail.com>
Subject: Re: [cache eviction] partition recomputation in big lineage RDDs
From: Hemant Bhanawat <hemant9379@gmail.com>
To: Nicolae Marasoiu <nicolae.marasoiu@adswizz.com>
Cc: "user@spark.apache.org" <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=f46d043c81246cf9a9052105e45a

--f46d043c81246cf9a9052105e45a
Content-Type: text/plain; charset=UTF-8

As I understand, you don't need merge of  your historical data RDD with
your RDD_inc, what you need is merge of the computation results of the your
historical RDD with RDD_inc and so on.

IMO, you should consider having an external row store to hold your
computations. I say this because you need to update the rows of prior
computation based on the new data. Spark cached batches are column oriented
and any update to a spark cached batch is a costly op.


On Wed, Sep 30, 2015 at 10:59 PM, Nicolae Marasoiu <
nicolae.marasoiu@adswizz.com> wrote:

> Hi,
>
>
> An equivalent question would be: can the memory cache be selectively
> evicted from within a component run in the driver? I know it is breaking
> some abstraction/encapsulation, but clearly I need to evict part of the
> cache so that it is reloaded with newer values from DB.
>
>
> Because what I basically need is invalidating some portions of the data
> which have newer values. The "compute" method should be the same (read with
> TableInputFormat).
>
> Thanks
> Nicu
> ------------------------------
> *From:* Nicolae Marasoiu <nicolae.marasoiu@adswizz.com>
> *Sent:* Wednesday, September 30, 2015 4:07 PM
> *To:* user@spark.apache.org
> *Subject:* Re: partition recomputation in big lineage RDDs
>
>
> Hi,
>
> In fact, my RDD will get a new version (a new RDD assigned to the same
> var) quite frequently, by merging bulks of 1000 events of events of last
> 10s.
>
> But recomputation would be more efficient to do not by reading initial RDD
> partition(s) and reapplying deltas, but by reading from HBase the latest
> data, and just compute on top of that if anything.
>
> Basically I guess I need to write my own RDD and implement compute method
> by sliding on hbase.
>
> Thanks,
> Nicu
> ------------------------------
> *From:* Nicolae Marasoiu <nicolae.marasoiu@adswizz.com>
> *Sent:* Wednesday, September 30, 2015 3:05 PM
> *To:* user@spark.apache.org
> *Subject:* partition recomputation in big lineage RDDs
>
>
> Hi,
>
>
> If I implement a manner to have an up-to-date version of my RDD by
> ingesting some new events, called RDD_inc (from increment), and I provide a
> "merge" function m(RDD, RDD_inc), which returns the RDD_new, it looks like
> I can evolve the state of my RDD by constructing new RDDs all the time, and
> doing it in a manner that hopes to reuse as much data from the past RDD and
> make the rest garbage collectable. An example merge function would be a
> join on some ids, and creating a merged state for each element. The type of
> the result of m(RDD, RDD_inc) is the same type as that of RDD.
>
>
> My question on this is how does the recomputation work for such an RDD,
> which is not the direct result of hdfs load, but is the result of a long
> lineage of such functions/transformations:
>
>
> Lets say my RDD is now after 2 merge iterations like this:
>
> RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2)
>
>
> When recomputing a part of RDD_new here are my assumptions:
>
> - only full partitions are recomputed, nothing more granular?
>
> - the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed
>
> - the function are applied
>
>
> And this seems more simplistic, since the partitions do not fully align in
> the general case between all these RDDs. The other aspect is the
> potentially redundant load of data which is in fact not required anymore
> (the data ruled out in the merge).
>
>
> A more detailed version of this question is at
> https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/
>
>
> Thanks,
>
> Nicu
>

--f46d043c81246cf9a9052105e45a
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>As I understand, you don&#39;t need merge of =C2=A0yo=
ur historical data RDD with your RDD_inc, what you need is merge of the com=
putation results of the your historical RDD with RDD_inc and so on.=C2=A0</=
div><div><br></div><div>IMO, you should consider having an external row sto=
re to hold your computations. I say this because you need to update the row=
s of prior computation based on the new data. Spark cached batches are colu=
mn oriented and any update to a spark cached batch is a costly op.=C2=A0</d=
iv><div><br></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_=
quote">On Wed, Sep 30, 2015 at 10:59 PM, Nicolae Marasoiu <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:nicolae.marasoiu@adswizz.com" target=3D"_blank">nico=
lae.marasoiu@adswizz.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex">


<div dir=3D"ltr">
<div style=3D"font-size:12pt;color:#000000;background-color:#ffffff;font-fa=
mily:Calibri,Arial,Helvetica,sans-serif">
<p>Hi,</p>
<p><br>
</p>
<p>An equivalent question would be: can the memory cache be selectively evi=
cted from within a component run in the driver? I know it is breaking some =
abstraction/encapsulation, but clearly I need to evict part of the cache so=
 that it is reloaded with newer
 values from DB.</p>
<p><br>
</p>
<p>Because what I basically need is invalidating some portions of the data =
which have newer values. The &quot;compute&quot; method should be the same =
(read with TableInputFormat).</p>
<br>
Thanks
<div>Nicu<br>
<div style=3D"color:rgb(0,0,0)">
<hr style=3D"display:inline-block;width:98%">
<div dir=3D"ltr"><font face=3D"Calibri, sans-serif" color=3D"#000000" style=
=3D"font-size:11pt"><b>From:</b> Nicolae Marasoiu &lt;<a href=3D"mailto:nic=
olae.marasoiu@adswizz.com" target=3D"_blank">nicolae.marasoiu@adswizz.com</=
a>&gt;<br>
<b>Sent:</b> Wednesday, September 30, 2015 4:07 PM<br>
<b>To:</b> <a href=3D"mailto:user@spark.apache.org" target=3D"_blank">user@=
spark.apache.org</a><br>
<b>Subject:</b> Re: partition recomputation in big lineage RDDs</font>
<div>=C2=A0</div>
</div><span class=3D"">
<div>
<div style=3D"font-size:12pt;color:#000000;background-color:#ffffff;font-fa=
mily:Calibri,Arial,Helvetica,sans-serif">
<p>Hi,</p>
<br>
In fact, my RDD will get a new version (a new RDD assigned to the same var)=
 quite frequently, by merging bulks of 1000 events of events of last 10s.
<div><br>
</div>
<div>But recomputation would be more efficient to do not by reading initial=
 RDD partition(s) and reapplying deltas, but by reading from HBase the late=
st data, and just compute on top of that if anything.</div>
<div><br>
</div>
<div>Basically I guess I need to write my own RDD and implement compute met=
hod by sliding on hbase.</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Nicu<br>
<div style=3D"color:rgb(0,0,0)">
<hr style=3D"display:inline-block;width:98%">
<div dir=3D"ltr"><font face=3D"Calibri, sans-serif" color=3D"#000000" style=
=3D"font-size:11pt"><b>From:</b> Nicolae Marasoiu &lt;<a href=3D"mailto:nic=
olae.marasoiu@adswizz.com" target=3D"_blank">nicolae.marasoiu@adswizz.com</=
a>&gt;<br>
<b>Sent:</b> Wednesday, September 30, 2015 3:05 PM<br>
<b>To:</b> <a href=3D"mailto:user@spark.apache.org" target=3D"_blank">user@=
spark.apache.org</a><br>
<b>Subject:</b> partition recomputation in big lineage RDDs</font>
<div>=C2=A0</div>
</div>
<div>
<div style=3D"font-size:12pt;color:#000000;background-color:#ffffff;font-fa=
mily:Calibri,Arial,Helvetica,sans-serif">
<p>Hi,</p>
<p><br>
</p>
<p>If I implement a manner to have an up-to-date version of my RDD by inges=
ting some new events, called RDD_inc (from increment), and I provide a &quo=
t;merge&quot; function m(RDD, RDD_inc), which returns the RDD_new, it looks=
 like I can evolve the state of my RDD by
 constructing new RDDs all the time, and doing it in a manner that hopes to=
 reuse as much data from the past RDD and make the rest garbage collectable=
. An example merge function would be a join on some ids, and creating a mer=
ged state for each element. The
 type of the result of m(RDD, RDD_inc) is the same type as that of RDD.</p>
<p><br>
</p>
<p>My question on this is how does the recomputation work for such an RDD, =
which is not the direct result of hdfs load, but is the result of a long li=
neage of such functions/transformations:</p>
<p><br>
</p>
<p>Lets say my RDD is now after 2 merge iterations like this:</p>
<p>RDD_new =3D merge(merge(RDD, RDD_inc1), RDD_inc2)</p>
<p><br>
</p>
<p>When recomputing a part of RDD_new here are my assumptions:</p>
<p>- only full partitions are recomputed, nothing more granular?</p>
<p>- the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recompu=
ted</p>
<p>- the function are applied</p>
<p><br>
</p>
<p>And this seems more simplistic, since the partitions do not fully align =
in the general case between all these RDDs. The other aspect is the potenti=
ally redundant load of data which is in fact not required anymore (the data=
 ruled out in the merge).</p>
<p><br>
</p>
<p>A more detailed version of this question is at=C2=A0<a href=3D"https://w=
ww.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-c=
omputation/" title=3D"https://www.quora.com/How-does-Spark-RDD-recomputatio=
n-avoids-duplicate-loading-or-computation/
Cmd+Click or tap to follow the link" target=3D"_blank">https://www.quora.co=
m/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/=
</a></p>
<p><br>
</p>
<p>Thanks,</p>
<p>Nicu</p>
</div>
</div>
</div>
</div>
</div>
</div>
</span></div>
</div>
</div>
</div>

</blockquote></div><br></div>

--f46d043c81246cf9a9052105e45a--