spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Hu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set
Date Wed, 15 Apr 2015 09:23:59 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495914#comment-14495914
] 

Jack Hu commented on SPARK-6847:
--------------------------------

I did a little more investigation about this issue, that appears to be a problem with some
operations({{updateStateByKey}}, {{reduceByKeyAndWindow}} with in-reduce function) which must
be check-pointed and followed by a operation with checkpoint (either manual added like the
code of this JIRA description or an operation which must be check-pointed) and the checkpoint
interval of these two operation is the same (or the followed operation has a checkpoint interval
the same with batch interval).
The following code will have this issue: assume default batch interval is 2 seconds, the default
checkpoint interval is 10 seconds
# {{source.updateStateByKey(func).map(f).checkpoint(10 seconds)}} 
# {{source.updateStateByKey(func).map(f).updateStateByKey(func2)}}
# {{source.updateStateByKey(func).map(f).checkpoint(2 seconds)}} 

These DO NOT have this issue
# {{source.updateStateByKey(func).map(f).checkpoint(4 seconds)}} 
# {{source.updateStateByKey(func).map(f).updateStateByKey(func2).checkpoint(4 seconds)}}

A rdd graph which contains two rdds needs to be check-pointed would be generated from these
sample codes. 

If the child(ren) rdd(s) also need to do the checkpoint at the same time the parent needs
to do, then the parent will not do checkpoint according the {{rdd.doCheckpoint}}. In this
case, the rdd comes from {{updateStateByKey}} will never be check-pointed at the issued sample
code, that leads the stack overflow. ({{updateStateByKey}} needs checkpoint to break the dependency
in this operation) 

If the child(ren) rdd(s) is not always check-pointed at the same time of the parent needs
to do, there is a chance that the parent rdd (comes from {{updateStateByKey}}) can do some
successful checkpoint to break the dependency, although the checkpoint may have some delay.
So no stack overflow will happen.

So, currently, we got a workaround of this issue by setting the checkpoint interval to different
values if we use operations that must be check-pointed in streaming project. Maybe this is
not a easy fix here, hope we can add some validation at least

> Stack overflow on updateStateByKey which followed by a dstream with checkpoint set
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-6847
>                 URL: https://issues.apache.org/jira/browse/SPARK-6847
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.3.0
>            Reporter: Jack Hu
>              Labels: StackOverflowError, Streaming
>
> The issue happens with the following sample code: uses {{updateStateByKey}} followed
by a {{map}} with checkpoint interval 10 seconds
> {code}
>     val sparkConf = new SparkConf().setAppName("test")
>     val streamingContext = new StreamingContext(sparkConf, Seconds(10))
>     streamingContext.checkpoint("""checkpoint""")
>     val source = streamingContext.socketTextStream("localhost", 9999)
>     val updatedResult = source.map(
>         (1,_)).updateStateByKey(
>             (newlist : Seq[String], oldstate : Option[String]) =>     newlist.headOption.orElse(oldstate))
>     updatedResult.map(_._2)
>     .checkpoint(Seconds(10))
>     .foreachRDD((rdd, t) => {
>       println("Deep: " + rdd.toDebugString.split("\n").length)
>       println(t.toString() + ": " + rdd.collect.length)
>     })
>     streamingContext.start()
>     streamingContext.awaitTermination()
> {code}
> From the output, we can see that the dependency will be increasing time over time, the
{{updateStateByKey}} never get check-pointed,  and finally, the stack overflow will happen.

> Note:
> * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but not the
{{updateStateByKey}} 
> * If remove the {{checkpoint(Seconds(10))}} from the map result ( {{updatedResult.map(_._2)}}
), the stack overflow will not happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message