spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kayousterhout <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-7826][CORE] Suppress extra calling getC...
Date Wed, 27 May 2015 18:38:36 GMT
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6352#discussion_r31166040
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -342,6 +342,29 @@ class DAGSchedulerSuite
         assert(locs === Seq(Seq("hostA", "hostB"), Seq("hostB", "hostC"), Seq("hostC", "hostD")))
       }
     
    +  /**
    +   * +---+ shuffle +---+    +---+    +---+
    +   * | A |<--------| B |<---| C |<---| D |
    +   * +---+         +---+    +---+    +---+
    +   * Here, D has one-to-one dependencies on C. C is derived from A by performing a shuffle
    +   * and then a map. If we're trying to determine which ancestor stages need to be computed
in
    +   * order to compute D, we need to figure out whether the shuffle A -> B should be
performed.
    +   * If the RDD C, which has only one ancestor via a narrow dependency, is cached, then
we won't
    +   * need to compute A, even if it has some unavailable output partitions. The same goes
for B:
    +   * if B is 100% cached, then we can avoid the shuffle on A.
    +   */
    +  test("SPARK-7826: getMissingParentStages should consider all ancestor RDDs' cache statuses")
{
    +    val rddA = new MyRDD(sc, 1, Nil)
    +    val rddB = new MyRDD(sc, 1, List(new ShuffleDependency(rddA, null)))
    +    val rddC = new MyRDD(sc, 1, List(new OneToOneDependency(rddB))).cache()
    +    val rddD = new MyRDD(sc, 1, List(new OneToOneDependency(rddC)))
    +    cacheLocations(rddC.id -> 0) =
    +      Seq(makeBlockManagerId("hostA"), makeBlockManagerId("hostB"))
    +    submit(rddD, Array(0))
    +    assert(scheduler.runningStages.size === 1)
    +    assert(scheduler.runningStages.head.id === 1)
    --- End diff --
    
    (I think this is more intuitive; otherwise, it's hard for someone looking at this to understand
why the ID should be 1. This also makes the test more agnostic to unrelated scheduler internals,
like if we change the way we assign IDs to stages)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message