spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?
Date Sun, 26 Jan 2014 07:58:44 GMT
I looked into this after I opened that JIRA and it’s actually a bit harder to fix. While
changing these visit() calls to use a stack manually instead of being recursive helps avoid
a StackOverflowError there, you still get a StackOverflowError when you send the task to a
worker node because Java serialization uses recursion. The only real fix therefore with the
current codebase is to increase your JVM stack size. Longer-term, I’d like us to automatically
call checkpoint() to break lineage graphs when they exceed a certain size, which would avoid
the problems in both DAGScheduler and Java serialization. We could also manually add this
to ALS now without having a solution for other programs. That would be a great change to make
to fix this JIRA.

Matei

On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava <me@ewencp.org> wrote:

> The three obvious ones in DAGScheduler.scala are in:
> 
> getParentStages
> getMissingParentStages
> stageDependsOn
> 
> They all follow the same pattern though (def visit(), followed by visit(root)), so they
should be easy to replace with a Scala stack in place of the call stack.
> 
>> 	Shao, Saisai	January 25, 2014 at 10:52 PM
>> In my test I found this phenomenon might be caused by RDD's long dependency chain,
this dependency chain is serialized into task and sent to each executor, while deserializing
this task will cause stack overflow.
>> 
>> Especially in iterative job, like:
>> var rdd = ..
>> 
>> for (i <- 0 to 100)
>> rdd = rdd.map(x=>x)
>> 
>> rdd = rdd.cache
>> 
>> Here rdd's dependency will be chained, at some point stack overflow will occur.
>> 
>> You can check (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ)
and (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ)
for details. Current workaround method is to cut the dependency chain by checkpointing RDD,
maybe a better way is to clean the dependency chain after materialize stage is executed.
>> 
>> Thanks
>> Jerry
>> 
>> -----Original Message-----
>> From: Reynold Xin [mailto:rxin@databricks.com] 
>> Sent: Sunday, January 26, 2014 2:04 PM
>> To: dev@spark.incubator.apache.org
>> Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too
many iterations"?
>> 
>> I'm not entirely sure, but two candidates are
>> 
>> the visit function in stageDependsOn
>> 
>> submitStage
>> 
>> 
>> 
>> 
>> 
>> 
>> 	Reynold Xin	January 25, 2014 at 10:03 PM
>> I'm not entirely sure, but two candidates are
>> 
>> the visit function in stageDependsOn
>> 
>> submitStage
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 	Aaron Davidson	January 25, 2014 at 10:01 PM
>> I'm an idiot, but which part of the DAGScheduler is recursive here? Seems
>> like processEvent shouldn't have inherently recursive properties.
>> 
>> 
>> 
>> 	Reynold Xin	January 25, 2014 at 9:57 PM
>> It seems to me fixing DAGScheduler to make it not recursive is the better
>> solution here, given the cost of checkpointing.
>> 
>> 
>> 	Xia, Junluan	January 25, 2014 at 9:49 PM
>> Hi all
>> 
>> The description about this Bug submitted by Matei is as following
>> 
>> 
>> The tipping point seems to be around 50. We should fix this by checkpointing the
RDDs every 10-20 iterations to break the lineage chain, but checkpointing currently requires
HDFS installed, which not all users will have.
>> 
>> We might also be able to fix DAGScheduler to not be recursive.
>> 
>> 
>> regards,
>> Andrew
>> 
>> 


Mime
View raw message