spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From algermissen1971 <algermissen1...@icloud.com>
Subject Re: Sessionization using updateStateByKey
Date Wed, 15 Jul 2015 13:54:46 GMT
Hi Cody,

oh ... I though that was one of *the* use cases for it. Do you have a suggestion / best practice
how to achieve the same thing with better scaling characteristics?

Jan

On 15 Jul 2015, at 15:33, Cody Koeninger <cody@koeninger.org> wrote:

> I personally would try to avoid updateStateByKey for sessionization when you have long
sessions / a lot of keys, because it's linear on the number of keys.
> 
> On Tue, Jul 14, 2015 at 6:25 PM, Tathagata Das <tdas@databricks.com> wrote:
> [Apologies for repost, for those who have seen this response already in the dev mailing
list]
> 
> 1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves
the state RDD (which is a snapshot of all the state data) to HDFS using RDD checkpointing.
In fact, a streaming app with updateStateByKey will not start until you set checkpoint directory.

> 
> 2. The updateStateByKey performance is sort of independent of the what is the source
that is being use - receiver based or direct Kafka. The absolutely performance obvious depends
on a LOT of variables, size of the cluster, parallelization, etc. The key things is that you
must ensure sufficient parallelization at every stage - receiving, shuffles (updateStateByKey
included), and output. 
> 
> Some more discussion in my talk - https://www.youtube.com/watch?v=d5UJonrruHk
> 
> 
> 
> On Tue, Jul 14, 2015 at 4:13 PM, swetha <swethakasireddy@gmail.com> wrote:
> 
> Hi,
> 
> I have a question regarding sessionization using updateStateByKey. If near
> real time state needs to be maintained in a Streaming application, what
> happens when the number of RDDs to maintain the state becomes very large?
> Does it automatically get saved to HDFS and reload when needed or do I have
> to use any code like ssc.checkpoint(checkpointDir)?  Also, how is the
> performance if I use both DStream Checkpointing for maintaining the state
> and use Kafka Direct approach for exactly once semantics?
> 
> 
> Thanks,
> Swetha
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sessionization-using-updateStateByKey-tp23838.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message