spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: RFC: Supporting the Scala drop Method for Spark RDDs
Date Tue, 22 Jul 2014 05:55:05 GMT
Yes, that could work. But it is not as simple as just a binary flag.

We might want to skip the first row for every file, or the header only for
the first file. The former is not really supported out of the box by the
input format I think?


On Mon, Jul 21, 2014 at 10:50 PM, Sandy Ryza <sandy.ryza@cloudera.com>
wrote:

> It could make sense to add a skipHeader argument to SparkContext.textFile?
>
>
> On Mon, Jul 21, 2014 at 10:37 PM, Reynold Xin <rxin@databricks.com> wrote:
>
> > If the purpose is for dropping csv headers, perhaps we don't really need
> a
> > common drop and only one that drops the first line in a file? I'd really
> > try hard to avoid a common drop/dropWhile because they can be expensive
> to
> > do.
> >
> > Note that I think we will be adding this functionality (ignoring headers)
> > to the CsvRDD functionality in Spark SQL.
> >  https://github.com/apache/spark/pull/1351
> >
> >
> > On Mon, Jul 21, 2014 at 1:45 PM, Mark Hamstra <mark@clearstorydata.com>
> > wrote:
> >
> > > You can find some of the prior, related discussion here:
> > > https://issues.apache.org/jira/browse/SPARK-1021
> > >
> > >
> > > On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson <eje@redhat.com>
> wrote:
> > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > Rather than embrace non-lazy transformations and add more of them,
> > I'd
> > > > > rather we 1) try to fully characterize the needs that are driving
> > their
> > > > > creation/usage; and 2) design and implement new Spark abstractions
> > that
> > > > > will allow us to meet those needs and eliminate existing non-lazy
> > > > > transformation.
> > > >
> > > >
> > > > In the case of drop, obtaining the index of the boundary partition
> can
> > be
> > > > viewed as the action forcing compute -- one that happens to be
> invoked
> > > > inside of a transform.  The concept of a "lazy action", that is only
> > > > triggered if the result rdd has compute invoked on it, might be
> > > sufficient
> > > > to restore laziness to the drop transform.   For that matter, I might
> > > find
> > > > some way to make use of Scala lazy values directly and achieve the
> same
> > > > goal for drop.
> > > >
> > > >
> > > >
> > > > > They really mess up things like creation of asynchronous
> > > > > FutureActions, job cancellation and accounting of job resource
> usage,
> > > > etc.,
> > > > > so I'd rather we seek a way out of the existing hole rather than
> make
> > > it
> > > > > deeper.
> > > > >
> > > > >
> > > > > On Mon, Jul 21, 2014 at 10:24 AM, Erik Erlandson <eje@redhat.com>
> > > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > Sure, drop() would be useful, but breaking the "transformations
> > are
> > > > lazy;
> > > > > > > only actions launch jobs" model is abhorrent -- which is
not to
> > say
> > > > that
> > > > > > we
> > > > > > > haven't already broken that model for useful operations
(cf.
> > > > > > > RangePartitioner, which is used for sorted RDDs), but rather
> that
> > > > each
> > > > > > such
> > > > > > > exception to the model is a significant source of pain
that can
> > be
> > > > hard
> > > > > > to
> > > > > > > work with or work around.
> > > > > >
> > > > > > A thought that comes to my mind here is that there are in fact
> > > already
> > > > two
> > > > > > categories of transform: ones that are truly lazy, and ones
that
> > are
> > > > not.
> > > > > >  A possible option is to embrace that, and commit to documenting
> > the
> > > > two
> > > > > > categories as such, with an obvious bias towards favoring lazy
> > > > transforms
> > > > > > (to paraphrase Churchill, we're down to haggling over the price).
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > I really wouldn't like to see another such model-breaking
> > > > transformation
> > > > > > > added to the API.  On the other hand, being able to write
> > > > transformations
> > > > > > > with dependencies on these kind of "internal" jobs is sometimes
> > > very
> > > > > > > useful, so a significant reworking of Spark's Dependency
model
> > that
> > > > would
> > > > > > > allow for lazily running such internal jobs and making
the
> > results
> > > > > > > available to subsequent stages may be something worth pursuing.
> > > > > >
> > > > > >
> > > > > > This seems like a very interesting angle.   I don't have much
> feel
> > > for
> > > > > > what a solution would look like, but it sounds as if it would
> > involve
> > > > > > caching all operations embodied by RDD transform method code
for
> > > > > > provisional execution.  I believe that these levels of invocation
> > are
> > > > > > currently executed in the master, not executor nodes.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jul 21, 2014 at 8:27 AM, Andrew Ash <
> > andrew@andrewash.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Personally I'd find the method useful -- I've often
had a
> .csv
> > > file
> > > > > > with a
> > > > > > > > header row that I want to drop so filter it out, which
> touches
> > > all
> > > > > > > > partitions anyway.  I don't have any comments on the
> > > implementation
> > > > > > quite
> > > > > > > > yet though.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jul 21, 2014 at 8:24 AM, Erik Erlandson <
> > eje@redhat.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > A few weeks ago I submitted a PR for supporting
> rdd.drop(n),
> > > > under
> > > > > > > > > SPARK-2315:
> > > > > > > > > https://issues.apache.org/jira/browse/SPARK-2315
> > > > > > > > >
> > > > > > > > > Supporting the drop method would make some operations
> > > convenient,
> > > > > > however
> > > > > > > > > it forces computation of >= 1 partition of
the parent RDD,
> > and
> > > > so it
> > > > > > > > would
> > > > > > > > > behave like a "partial action" that returns an
RDD as the
> > > result.
> > > > > > > > >
> > > > > > > > > I wrote up a discussion of these trade-offs here:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > >
> >
> http://erikerlandson.github.io/blog/2014/07/20/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message