spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JoshRosen <>
Subject [GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...
Date Tue, 30 Dec 2014 18:28:42 GMT
Github user JoshRosen commented on the pull request:
    > How would this interact with the idea of @erikerlandson to defer partition computation?
    Maybe I'm overlooking something, but #3079 seems kind of orthogonal.  It seems like that
issue is concerned with making the `sortByKey` transformation lazy so that it does not eagerly
trigger a Spark job to compute the range partition boundaries, whereas this pull request is
related to eager vs. lazy evaluation of what's effectively a Hadoop filesystem metadata call.
    Maybe eager vs. lazy is the wrong way to think about this PR's issue, though, since I
guess we're more concerned with _where_ the call is performed (blocking DAGScheduler's event
loop vs. a driver user-code thread) than when it's performed.  I suppose that maybe you could
contrive an example where this patch changes the behavior of a user job, since maybe someone
defines some transformations up-front, runs jobs to generate output, then reads it back in
another RDD, in which case the data to be read might not exist at the time that the RDD is
defined but will exist when the first action on it is invoked.  So, maybe we should consider
moving the first `partitions` call closer to the DAGScheduler's job submission methods, but
not inside of the actor (e.g. don't change any code in `RDD`, but just add a call that traverses
the lineage chain and calls `partitions` on each RDD, making sure that this call occurs before
the job submitter sends a message into the DAGScheduler actor).

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message