spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: Streaming checkpoints and logic change
Date Wed, 08 Jul 2015 17:13:05 GMT
You can use DStream.transform for some stuff. Transform takes a RDD => RDD
function that allow arbitrary RDD operations to be done on RDDs of a
DStream. This function gets evaluated on the driver on every batch
interval. If you are smart about writing the function, it can do different
stuff at different intervals. For example, you can always use a
continuously updated set of filters

dstream.transform { rdd =>
   val broadcastedFilters = Filters.getLatest()
   val newRDD  = rdd.filter { x => broadcastedFilters.get.filter(x) }
   newRDD
}


The function Filters.getLatest() will return the latest set of filters that
is broadcasted out, and as the transform function is processed in every
batch interval, it will always use the latest filters.

HTH.

TD

On Wed, Jul 8, 2015 at 10:02 AM, Jong Wook Kim <jongwook@nyu.edu> wrote:

> I just asked this question at the streaming webinar that just ended, but
> the speakers didn't answered so throwing here:
>
> AFAIK checkpoints are the only recommended method for running Spark
> streaming without data loss. But it involves serializing the entire dstream
> graph, which prohibits any logic changes. How should I update / fix logic
> of a running streaming app without any data loss?
>
> Jong Wook
>

Mime
View raw message