spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Erlandson <...@redhat.com>
Subject Re: RFC: Supporting the Scala drop Method for Spark RDDs
Date Tue, 29 Jul 2014 23:51:02 GMT


----- Original Message -----
> Sure, drop() would be useful, but breaking the "transformations are lazy;
> only actions launch jobs" model is abhorrent -- which is not to say that we
> haven't already broken that model for useful operations (cf.
> RangePartitioner, which is used for sorted RDDs), but rather that each such
> exception to the model is a significant source of pain that can be hard to
> work with or work around.
> 
> I really wouldn't like to see another such model-breaking transformation
> added to the API.  On the other hand, being able to write transformations
> with dependencies on these kind of "internal" jobs is sometimes very
> useful, so a significant reworking of Spark's Dependency model that would
> allow for lazily running such internal jobs and making the results
> available to subsequent stages may be something worth pursuing.


It turns out that drop can be implemented as a proper lazy transform.  I discuss how that
works here:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/

I updated the PR with this lazy implementation.




> 
> 
> On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <andrew@andrewash.com> wrote:
> 
> > Personally I'd find the method useful -- I've often had a .csv file with a
> > header row that I want to drop so filter it out, which touches all
> > partitions anyway.  I don't have any comments on the implementation quite
> > yet though.
> >
> >
> > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <eje@redhat.com> wrote:
> >
> > > A few weeks ago I submitted a PR for supporting rdd.drop(n), under
> > > SPARK-2315:
> > > https://issues.apache.org/jira/browse/SPARK-2315
> > >
> > > Supporting the drop method would make some operations convenient, however
> > > it forces computation of >= 1 partition of the parent RDD, and so it
> > would
> > > behave like a "partial action" that returns an RDD as the result.
> > >
> > > I wrote up a discussion of these trade-offs here:
> > >
> > >
> > http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> > >
> >
> 

Mime
View raw message