spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-6075) After SPARK-3885, some tasks' accumulator updates may be lost
Date Sun, 01 Mar 2015 06:54:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-6075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Rosen resolved SPARK-6075.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.4.0

Issue resolved by pull request 4835
[https://github.com/apache/spark/pull/4835]

> After SPARK-3885, some tasks' accumulator updates may be lost
> -------------------------------------------------------------
>
>                 Key: SPARK-6075
>                 URL: https://issues.apache.org/jira/browse/SPARK-6075
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Tests
>    Affects Versions: 1.4.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Blocker
>             Fix For: 1.4.0
>
>
> It looks like some of the AccumulatorSuite tests have started failing nondeterministically
on Jenkins.  The errors seem to be due to lost / missing accumulator updates, e.g.
> {code}
> Set(843, 356, 437, [...], 181, 618, 131) did not contain element 901
> {code}
> This could somehow be related to SPARK-3885 / https://github.com/apache/spark/pull/4021,
a patch to garbage-collect accumulators, which was only merged into master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/lastCompletedBuild/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/org.apache.spark/AccumulatorSuite/add_value_to_collection_accumulators/
> I think I've figured it out: consider the lifecycle of an accumulator in a task, say
ShuffleMapTask: on the executor, each task deserializes its own copy of the RDD inside of
its runTask method, so the strong reference to the RDD disappears at the end of runTask. In
Executor.run(), we call Accumulators.values after runTask has exited, so there's a small window
in which the tasks's RDD can be GC'd, causing accumulators to be GC'd as well because there
are no longer any strong references to them.
> The fix is to keep strong references in localAccums, since we clear this at the end of
each task anyways. I'm glad that I was able to figure out precisely why this was necessary
and sorry that I missed this during review; I'll submit a fix shortly. In terms of preventative
measures, it might be a good idea to write up the lifetime / lifecycle of objects' strong
references whenever we're using WeakReferences, since the process of explicitly writing that
out would prevent these sorts of mistakes in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message