spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Punyashloka Biswal <>
Subject Re: [discuss] new Java friendly InputSource API
Date Tue, 21 Apr 2015 23:06:26 GMT
Reynold, thanks for this! At Palantir we're heavy users of the Java APIs
and appreciate being able to stop hacking around with fake ClassTags :)

Regarding this specific proposal, is the contract of RecordReader#get
intended to be that it returns a fresh object each time? Or is it allowed
to mutate a fixed object and return a pointer to it each time?

Put another way, is a caller supposed to clone the output of get() if they
want to use it later?

On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin <> wrote:

> I created a pull request last night for a new InputSource API that is
> essentially a stripped down version of the RDD API for providing data into
> Spark. Would be great to hear the community's feedback.
> Spark currently has two de facto input source API:
> 1. RDD
> 2. Hadoop MapReduce InputFormat
> Neither of the above is ideal:
> 1. RDD: It is hard for Java developers to implement RDD, given the implicit
> class tags. In addition, the RDD API depends on Scala's runtime library,
> which does not preserve binary compatibility across Scala versions. If a
> developer chooses Java to implement an input source, it would be great if
> that input source can be binary compatible in years to come.
> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive.
> For example, it forces key-value semantics, and does not support running
> arbitrary code on the driver side (an example of why this is useful is
> broadcast). In addition, it is somewhat awkward to tell developers that in
> order to implement an input source for Spark, they should learn the Hadoop
> MapReduce API first.
> My patch creates a new InputSource interface, described by:
> - an array of InputPartition that specifies the data partitioning
> - a RecordReader that specifies how data on each partition can be read
> This interface is similar to Hadoop's InputFormat, except that there is no
> explicit key/value separation.
> JIRA ticket:
> Pull request:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message