spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JoshRosen <...@git.apache.org>
Subject [GitHub] spark pull request #14871: [SPARK-17304] Fix perf. issue caused by TaskSetMa...
Date Tue, 30 Aug 2016 00:13:31 GMT
GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/14871

    [SPARK-17304] Fix perf. issue caused by TaskSetManager.abortIfCompletelyBlacklisted

    This patch addresses a minor scheduler performance issue that was introduced in #13603.
If you run
    
    ```
    sc.parallelize(1 to 100000, 100000).map(identity).count()
    ```
    
    then most of the time ends up being spent in `TaskSetManager.abortIfCompletelyBlacklisted()`:
    
    ![image](https://cloud.githubusercontent.com/assets/50748/18071032/428732b0-6e07-11e6-88b2-c9423cd61f53.png)
    
    When processing resource offers, the scheduler uses a nested loop which considers every
task set at multiple locality levels:
    
    ```scala
       for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
          do {
            launchedTask = resourceOfferSingleTaskSet(
                taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
          } while (launchedTask)
        }
    ```
    
    In order to prevent jobs with globally blacklisted tasks from hanging, #13603 added a
`taskSet.abortIfCompletelyBlacklisted` call inside of  `resourceOfferSingleTaskSet`; if a
call to `resourceOfferSingleTaskSet` fails to schedule any tasks, then `abortIfCompletelyBlacklisted`
checks whether the tasks are completely blacklisted in order to figure out whether they will
ever be schedulable. The problem with this placement of the call is that the last call to
`resourceOfferSingleTaskSet` in the `while` loop will return `false`, so almost every call
to `resourceOffers` will trigger the `abortIfCompletelyBlacklisted` check for every task set.
    
    Instead, I think that this call should be moved out of the innermost loop and should be
called _at most_ once per task set in case none of the task set's tasks can be scheduled at
any locality level.
    
    Before this patch's changes, the microbenchmark example that I posted above took 35 seconds
to run, but it now only takes 15 seconds after this change.
    
    /cc @squito and @kayousterhout for review.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark bail-early-if-no-cpus

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14871.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14871
    
----
commit f51111c203857de2ea2120b482b6f4b9ae62e108
Author: Josh Rosen <joshrosen@databricks.com>
Date:   2016-08-29T23:18:41Z

    Bail out of loop early if no CPUs are available.

commit 5d20b445200ab23283dd9456f7bd3c765dd11d2a
Author: Josh Rosen <joshrosen@databricks.com>
Date:   2016-08-30T00:02:04Z

    Move abort logic out of inner loop.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message