spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <>
Subject [jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
Date Thu, 27 Nov 2014 01:33:13 GMT


Matei Zaharia commented on SPARK-732:

As discussed on this is pretty hard to provide good
semantics for in the general case (accumulator updates inside non-result stages), for the
following reasons:

- An RDD may be computed as part of multiple stages. For example, if you update an accumulator
inside a MappedRDD and then shuffle it, that might be one stage. But if you then call map()
again on the MappedRDD, and shuffle the result of that, you get a second stage where that
map is pipeline. Do you want to count this accumulator update twice or not?
- Entire stages may be resubmitted if shuffle files are deleted by the periodic cleaner or
are lost due to a node failure, so anything that tracks RDDs would need to do so for long
periods of time (as long as the RDD is referenceable in the user program), which would be
pretty complicated to implement.

So I'm going to mark this as "won't fix" for now, except for the part for result stages done
in SPARK-3628.

> Recomputation of RDDs may result in duplicated accumulator updates
> ------------------------------------------------------------------
>                 Key: SPARK-732
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.8.2, 0.9.0, 1.0.1,
>            Reporter: Josh Rosen
>            Assignee: Nan Zhu
>            Priority: Blocker
> Currently, Spark doesn't guard against duplicated updates to the same accumulator due
to recomputations of an RDD.  For example:
> {code}
>     val acc = sc.accumulator(0)
> => acc += 1; f(x))
>     data.count()
>     // acc should equal data.count() here
>     data.foreach{...}
>     // Now, acc = 2 * data.count() because the map() was recomputed.
> {code}
> I think that this behavior is incorrect, especially because this behavior allows the
additon or removal of a cache() call to affect the outcome of a computation.
> There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|].
> I haven't tested whether recomputation due to blocks being dropped from the cache can
trigger duplicate accumulator updates.
> Hypothetically someone could be relying on the current behavior to implement performance
counters that track the actual number of computations performed (including recomputations).
 To be safe, we could add an explicit warning in the release notes that documents the change
in behavior when we fix this.
> Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties.  Currently,
we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate
updates at the per-transformation level.  I haven't dug too deeply into the scheduler internals,
but we might also run into problems where pipelining causes what is logically one set of accumulator
updates to show up in two different tasks (e.g. += x; ...) and
+= x; ...).count() may cause what's logically the same accumulator update to be applied from
two different contexts, complicating the detection of duplicate updates).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message