spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolae Marasoiu <nicolae.maras...@adswizz.com>
Subject partition recomputation in big lineage RDDs
Date Wed, 30 Sep 2015 12:05:16 GMT
Hi,


If I implement a manner to have an up-to-date version of my RDD by ingesting some new events,
called RDD_inc (from increment), and I provide a "merge" function m(RDD, RDD_inc), which returns
the RDD_new, it looks like I can evolve the state of my RDD by constructing new RDDs all the
time, and doing it in a manner that hopes to reuse as much data from the past RDD and make
the rest garbage collectable. An example merge function would be a join on some ids, and creating
a merged state for each element. The type of the result of m(RDD, RDD_inc) is the same type
as that of RDD.


My question on this is how does the recomputation work for such an RDD, which is not the direct
result of hdfs load, but is the result of a long lineage of such functions/transformations:


Lets say my RDD is now after 2 merge iterations like this:

RDD_new = merge(merge(RDD, RDD_inc1), RDD_inc2)


When recomputing a part of RDD_new here are my assumptions:

- only full partitions are recomputed, nothing more granular?

- the corresponding partitions of RDD, RDD_inc1 and RDD_inc2 are recomputed

- the function are applied


And this seems more simplistic, since the partitions do not fully align in the general case
between all these RDDs. The other aspect is the potentially redundant load of data which is
in fact not required anymore (the data ruled out in the merge).


A more detailed version of this question is at https://www.quora.com/How-does-Spark-RDD-recomputation-avoids-duplicate-loading-or-computation/


Thanks,

Nicu

Mime
View raw message