spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
Date Sun, 08 Feb 2015 13:13:34 GMT


Sean Owen updated SPARK-5140:
    Component/s: Spark Core

> Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
> ---------------------------------------------------------------------------------------
>                 Key: SPARK-5140
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Corey J. Nolet
>              Labels: features
> Not sure if this would change too much of the internals to be included in the 1.2.1 but
it would be very helpful if it could be.
> This ticket is from a discussion between myself and [~ilikerps]. Here's the result of
some testing that [~ilikerps] did:
> bq. I did some testing as well, and it turns out the "wait for other guy to finish caching"
logic is on a per-task basis, and it only works on tasks that happen to be executing on the
same machine. 
> bq. Once a partition is cached, we will schedule tasks that touch that partition on that
executor. The problem here, though, is that the cache is in progress, and so the tasks are
still scheduled randomly (or with whatever locality the data source has), so tasks which end
up on different machines will not see that the cache is already in progress.
> {code}
> Here was my test, by the way:
> import
> import scala.concurrent._
> import scala.concurrent.duration._
> val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(10000); i }).cache()
> val futures = (0 until 4).map { _ => Future { rdd.count } }
> Await.result(Future.sequence(futures), 120.second)
> {code}
> bq. Note that I run the future 4 times in parallel. I found that the first run has all
tasks take 10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest
just wait for the first stage to finish. The last two runs have no tasks that take 10 seconds;
all wait for the first two stages to finish.
> What we want is the ability to fire off a job and have the DAG figure out that two RDDs
depend on the same parent so that when the children are scheduled concurrently, the first
one to start will activate the parent and both will wait on the parent. When the parent is
done, they will both be able to finish their work concurrently. We are trying to use this
pattern by having the parent cache results.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message