spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Commented] (SPARK-7308) Should there be multiple concurrent attempts for one stage?
Date Thu, 07 May 2015 04:28:59 GMT


Apache Spark commented on SPARK-7308:

User 'squito' has created a pull request for this issue:

> Should there be multiple concurrent attempts for one stage?
> -----------------------------------------------------------
>                 Key: SPARK-7308
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.1
>            Reporter: Imran Rashid
>            Assignee: Imran Rashid
> Currently, when there is a fetch failure, you can end up with multiple concurrent attempts
for the same stage.  Is this intended?  At best, it leads to some very confusing behavior,
and it makes it hard for the user to make sense of what is going on.  At worst, I think this
is cause of some very strange errors we've seen errors we've seen from users, where stages
start executing before all the dependent stages have completed.
> This can happen in the following scenario:  there is a fetch failure in attempt 0, so
the stage is retried.  attempt 1 starts.  But, tasks from attempt 0 are still running -- some
of them can also hit fetch failures after attempt 1 starts.  That will cause additional stage
attempts to get fired up.
> There is an attempt to handle this already
> but that only checks whether the **stage** is running.  It really should check whether
that **attempt** is still running, but there isn't enough info to do that.
> Given the release timeline, I'm going to submit a PR to just fail fast as soon as we
detect there are multiple concurrent attempts.  Would like some feedback from others on whether
or not this is a good thing to do.  (The crazy thing is, when I reproduce this, spark seems
to actually do the right thing despite the multiple attempts at the same stage, but I feel
like that is probably dumb luck from what I've been testing.)
> I'll also post some info on how to reproduce this.  Finally, if there really shouldn't
be multiple concurrent attempts, then we can open another ticket for the proper fix (as opposed
to just failiing fast) after the 1.4 release.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message