spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: problem about broadcast variable in iteration
Date Sun, 25 May 2014 21:47:54 GMT
Hi Randy,

In Spark 1.0 there was a lot of work done to allow unpersisting data that's
no longer needed.  See the below pull request.

Try running kvGlobal.unpersist() on line 11 before the re-broadcast of the
next variable to see if you can cut the dependency there.

https://github.com/apache/spark/pull/126

Alternatively, it sounds like your algorithm needs some additional state to
join against to produce each successive iteration of RDD.  Have you
considered storing that data in an RDD rather than a broadcast variable?

Andrew


On Wed, May 7, 2014 at 10:02 PM, randylu <randylu26@gmail.com> wrote:

> But when i put broadcast variable out of for-circle, it workes well(if not
> concerned about memory issue as you pointed out):
>  1  var rdd1 = ...
>  2  var rdd2 = ...
>  3  var kv = ...
>  4  var kvGlobal = sc.broadcast(kv)               // broadcast kv
>  5  for (i <- 0 until n) {
>  6    rdd1 = rdd2.map {
>  7      case t => doSomething(t, kvGlobal.value)
>  8    }.cache()
>  9    var tmp = rdd1.reduceByKey().collect()
> 10    kv = updateKV(tmp)                               // update kv for
> each
> iteration
> 11    kvGlobal = sc.broadcast(kv)               // broadcast kv
> 12    rdd2 = rdd1
> 13 }
> 14 rdd2.saveAsTextFile()
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5497.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message