spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qiuzhuang Lian <qiuzhuang.l...@gmail.com>
Subject Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?
Date Tue, 28 Jan 2014 09:15:42 GMT
I see this error thrown from Executor.scala:

task = ser.deserialize[Task[Any]](taskBytes,
Thread.currentThread.getContextClassLoader)

Any suggestions to break down the task to smaller chunk to avoid this?

Thanks,
Qiuzhuang




On Sun, Jan 26, 2014 at 2:52 PM, Shao, Saisai <saisai.shao@intel.com> wrote:

> In my test I found this phenomenon might be caused by RDD's long
> dependency chain, this dependency chain is serialized into task and sent to
> each executor, while deserializing this task will cause stack overflow.
>
> Especially in iterative job, like:
> var rdd = ..
>
> for (i <- 0 to 100)
>  rdd = rdd.map(x=>x)
>
> rdd = rdd.cache
>
> Here rdd's dependency will be chained, at some point stack overflow will
> occur.
>
> You can check (
> https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ)
> and (
> https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ)
> for details. Current workaround method is to cut the dependency chain by
> checkpointing RDD, maybe a better way is to clean the dependency chain
> after materialize stage is executed.
>
> Thanks
> Jerry
>
> -----Original Message-----
> From: Reynold Xin [mailto:rxin@databricks.com]
> Sent: Sunday, January 26, 2014 2:04 PM
> To: dev@spark.incubator.apache.org
> Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow
> with too many iterations"?
>
> I'm not entirely sure, but two candidates are
>
> the visit function in stageDependsOn
>
> submitStage
>
>
>
>
>
>
> On Sat, Jan 25, 2014 at 10:01 PM, Aaron Davidson <ilikerps@gmail.com>
> wrote:
>
> > I'm an idiot, but which part of the DAGScheduler is recursive here?
> > Seems like processEvent shouldn't have inherently recursive properties.
> >
> >
> > On Sat, Jan 25, 2014 at 9:57 PM, Reynold Xin <rxin@databricks.com>
> wrote:
> >
> > > It seems to me fixing DAGScheduler to make it not recursive is the
> > > better solution here, given the cost of checkpointing.
> > >
> > > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan
> > > <junluan.xia@intel.com>
> > > wrote:
> > >
> > > > Hi all
> > > >
> > > > The description about this Bug submitted by Matei is as following
> > > >
> > > >
> > > > The tipping point seems to be around 50. We should fix this by
> > > > checkpointing the RDDs every 10-20 iterations to break the lineage
> > chain,
> > > > but checkpointing currently requires HDFS installed, which not all
> > users
> > > > will have.
> > > >
> > > > We might also be able to fix DAGScheduler to not be recursive.
> > > >
> > > >
> > > > regards,
> > > > Andrew
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message