spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: renaming SchemaRDD -> DataFrame
Date Tue, 10 Feb 2015 19:47:55 GMT
so i understand the success or spark.sql. besides the fact that anything
with the words SQL in its name will have thousands of developers running
towards it because of the familiarity, there is also a genuine need for a
generic RDD that holds record-like objects, with field names and runtime
types. after all that is a successfull generic abstraction used in many
structured data tools.

but to me that abstraction is as simple as:

trait SchemaRDD extends RDD[Row] {
  def schema: StructType
}

and perhaps another abstraction to indicate it intends to be column
oriented (with a few methods to efficiently extract a subset of columns).
so that could be DataFrame.

such simple contracts would allow many people to write loaders for this
(say from csv) and whatnot.

what i do not understand why it has to be much more complex than this. but
if i look at DataFrame it has so much additional stuff, that has (in my
eyes) nothing to do with generic structured data analysis.

for example to implement DataFrame i need to implement about 40 additional
methods!? and for some the SQLness is obviously leaking into the
abstraction. for example why would i care about:
  def registerTempTable(tableName: String): Unit


best, koert

On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <velvia.github@gmail.com> wrote:

> It is true that you can persist SchemaRdds / DataFrames to disk via
> Parquet, but a lot of time and inefficiencies is lost.   The in-memory
> columnar cached representation is completely different from the
> Parquet file format, and I believe there has to be a translation into
> a Row (because ultimately Spark SQL traverses Row's -- even the
> InMemoryColumnarTableScan has to then convert the columns into Rows
> for row-based processing).   On the other hand, traditional data
> frames process in a columnar fashion.   Columnar storage is good, but
> nowhere near as good as columnar processing.
>
> Another issue, which I don't know if it is solved yet, but it is
> difficult for Tachyon to efficiently cache Parquet files without
> understanding the file format itself.
>
> I gave a talk at last year's Spark Summit on this topic.
>
> I'm working on efforts to change this, however.  Shoot me an email at
> velvia at gmail if you're interested in joining forces.
>
> On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <lian.cs.zju@gmail.com> wrote:
> > Yes, when a DataFrame is cached in memory, it's stored in an efficient
> > columnar format. And you can also easily persist it on disk using
> Parquet,
> > which is also columnar.
> >
> > Cheng
> >
> >
> > On 1/29/15 1:24 PM, Koert Kuipers wrote:
> >>
> >> to me the word DataFrame does come with certain expectations. one of
> them
> >> is that the data is stored columnar. in R data.frame internally uses a
> >> list
> >> of sequences i think, but since lists can have labels its more like a
> >> SortedMap[String, Array[_]]. this makes certain operations very cheap
> >> (such
> >> as adding a column).
> >>
> >> in Spark the closest thing would be a data structure where per Partition
> >> the data is also stored columnar. does spark SQL already use something
> >> like
> >> that? Evan mentioned "Spark SQL columnar compression", which sounds like
> >> it. where can i find that?
> >>
> >> thanks
> >>
> >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <velvia.github@gmail.com>
> >> wrote:
> >>
> >>> +1.... having proper NA support is much cleaner than using null, at
> >>> least the Java null.
> >>>
> >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <evan.sparks@gmail.com
> >
> >>> wrote:
> >>>>
> >>>> You've got to be a little bit careful here. "NA" in systems like R or
> >>>
> >>> pandas
> >>>>
> >>>> may have special meaning that is distinct from "null".
> >>>>
> >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rxin@databricks.com>
> >>>
> >>> wrote:
> >>>>>
> >>>>> Isn't that just "null" in SQL?
> >>>>>
> >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <velvia.github@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I believe that most DataFrame implementations out there, like
> Pandas,
> >>>>>> supports the idea of missing values / NA, and some support the
idea
> of
> >>>>>> Not Meaningful as well.
> >>>>>>
> >>>>>> Does Row support anything like that?  That is important for
certain
> >>>>>> applications.  I thought that Row worked by being a mutable
object,
> >>>>>> but haven't looked into the details in a while.
> >>>>>>
> >>>>>> -Evan
> >>>>>>
> >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rxin@databricks.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> It shouldn't change the data source api at all because data
sources
> >>>>>>
> >>>>>> create
> >>>>>>>
> >>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
> >>>>>>
> >>>>>> (previously
> >>>>>>>
> >>>>>>> to SchemaRDD).
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >>>>>>>
> >>>>>>> One thing that will break the data source API in 1.3 is
the
> location
> >>>>>>> of
> >>>>>>> types. Types were previously defined in sql.catalyst.types,
and now
> >>>>>>
> >>>>>> moved to
> >>>>>>>
> >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users,
and all
> >>>>>>> public
> >>>>>>
> >>>>>> APIs
> >>>>>>>
> >>>>>>> have first class classes/objects defined in sql directly.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <
> velvia.github@gmail.com
> >>>>>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hey guys,
> >>>>>>>>
> >>>>>>>> How does this impact the data sources API?  I was planning
on
> using
> >>>>>>>> this for a project.
> >>>>>>>>
> >>>>>>>> +1 that many things from spark-sql / DataFrame is universally
> >>>>>>>> desirable and useful.
> >>>>>>>>
> >>>>>>>> By the way, one thing that prevents the columnar compression
stuff
> >>>
> >>> in
> >>>>>>>>
> >>>>>>>> Spark SQL from being more useful is, at least from previous
talks
> >>>>>>>> with
> >>>>>>>> Reynold and Michael et al., that the format was not
designed for
> >>>>>>>> persistence.
> >>>>>>>>
> >>>>>>>> I have a new project that aims to change that.  It is
a
> >>>>>>>> zero-serialisation, high performance binary vector library,
> >>>
> >>> designed
> >>>>>>>>
> >>>>>>>> from the outset to be a persistent storage friendly.
 May be one
> >>>
> >>> day
> >>>>>>>>
> >>>>>>>> it can replace the Spark SQL columnar compression.
> >>>>>>>>
> >>>>>>>> Michael told me this would be a lot of work, and recreates
parts
> of
> >>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like
more
> >>>
> >>> details.
> >>>>>>>>
> >>>>>>>> -Evan
> >>>>>>>>
> >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rxin@databricks.com
> >
> >>>>>>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Alright I have merged the patch (
> >>>>>>>>> https://github.com/apache/spark/pull/4173
> >>>>>>>>> ) since I don't see any strong opinions against
it (as a matter
> >>>
> >>> of
> >>>>>>
> >>>>>> fact
> >>>>>>>>>
> >>>>>>>>> most were for it). We can still change it if somebody
lays out a
> >>>>>>
> >>>>>> strong
> >>>>>>>>>
> >>>>>>>>> argument.
> >>>>>>>>>
> >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >>>>>>>>> <matei.zaharia@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> The type alias means your methods can specify
either type and
> >>>
> >>> they
> >>>>>>
> >>>>>> will
> >>>>>>>>>>
> >>>>>>>>>> work. It's just another name for the same type.
But Scaladocs
> >>>
> >>> and
> >>>>>>
> >>>>>> such
> >>>>>>>>>>
> >>>>>>>>>> will
> >>>>>>>>>> show DataFrame as the type.
> >>>>>>>>>>
> >>>>>>>>>> Matei
> >>>>>>>>>>
> >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini
Filho <
> >>>>>>>>>>
> >>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Reynold,
> >>>>>>>>>>> But with type alias we will have the same
problem, right?
> >>>>>>>>>>> If the methods doesn't receive schemardd
anymore, we will have
> >>>>>>>>>>> to
> >>>>>>>>>>> change
> >>>>>>>>>>> our code to migrade from schema to dataframe.
Unless we have
> >>>
> >>> an
> >>>>>>>>>>>
> >>>>>>>>>>> implicit
> >>>>>>>>>>> conversion between DataFrame and SchemaRDD
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rxin@databricks.com>:
> >>>>>>>>>>>
> >>>>>>>>>>>> Dirceu,
> >>>>>>>>>>>>
> >>>>>>>>>>>> That is not possible because one cannot
overload return
> >>>
> >>> types.
> >>>>>>>>>>>>
> >>>>>>>>>>>> SQLContext.parquetFile (and many other
methods) needs to
> >>>
> >>> return
> >>>>>>
> >>>>>> some
> >>>>>>>>>>
> >>>>>>>>>> type,
> >>>>>>>>>>>>
> >>>>>>>>>>>> and that type cannot be both SchemaRDD
and DataFrame.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In 1.3, we will create a type alias
for DataFrame called
> >>>>>>>>>>>> SchemaRDD
> >>>>>>>>>>>> to
> >>>>>>>>>>
> >>>>>>>>>> not
> >>>>>>>>>>>>
> >>>>>>>>>>>> break source compatibility for Scala.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu
Semighini Filho <
> >>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Can't the SchemaRDD remain the same,
but deprecated, and be
> >>>>>>
> >>>>>> removed
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> in
> >>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> release 1.5(+/- 1)  for example,
and the new code been added
> >>>>>>>>>>>>> to
> >>>>>>>>>>
> >>>>>>>>>> DataFrame?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> With this, we don't impact in existing
code for the next few
> >>>>>>>>>>>>> releases.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal
Datta
> >>>>>>>>>>>>> <kushal.datta@gmail.com>:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I want to address the issue
that Matei raised about the
> >>>
> >>> heavy
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> lifting
> >>>>>>>>>>>>>> required for a full SQL support.
It is amazing that even
> >>>>>>>>>>>>>> after
> >>>>>>
> >>>>>> 30
> >>>>>>>>>>
> >>>>>>>>>> years
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> research there is not a single
good open source columnar
> >>>>>>
> >>>>>> database
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> like
> >>>>>>>>>>>>>> Vertica. There is a column store
option in MySQL, but it is
> >>>>>>>>>>>>>> not
> >>>>>>>>>>>>>> nearly
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> sophisticated as Vertica or
MonetDB. But there's a true
> >>>
> >>> need
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>> such
> >>>>>>>>>>
> >>>>>>>>>> a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> system. I wonder why so and
it's high time to change that.
> >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy
Ryza"
> >>>>>>>>>>>>>> <sandy.ryza@cloudera.com>
> >>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame
sound fine to me, though I
> >>>
> >>> like
> >>>>>>
> >>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> former
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> slightly better because
it's more descriptive.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Even if SchemaRDD's needs
to rely on Spark SQL under the
> >>>>>>
> >>>>>> covers,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> be more clear from a user-facing
perspective to at least
> >>>>>>
> >>>>>> choose a
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> package
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> name for it that omits "sql".
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I would also be in favor
of adding a separate Spark Schema
> >>>>>>
> >>>>>> module
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Spark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> SQL to rely on, but I imagine
that might be too large a
> >>>>>>>>>>>>>>> change
> >>>>>>
> >>>>>> at
> >>>>>>>>>>
> >>>>>>>>>> this
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> point?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Sandy
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at
5:32 PM, Matei Zaharia <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> matei.zaharia@gmail.com>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> (Actually when we designed
Spark SQL we thought of giving
> >>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>> another
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> name,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> like Spark Schema, but
we decided to stick with SQL since
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> was
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> obvious use case to
many users.)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Matei
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Jan 26, 2015,
at 5:31 PM, Matei Zaharia <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> matei.zaharia@gmail.com>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> While it might be
possible to move this concept to Spark
> >>>>>>>>>>>>>>>>> Core
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> long-term,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> supporting structured
data efficiently does require
> >>>
> >>> quite a
> >>>>>>
> >>>>>> bit
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> infrastructure in Spark
SQL, such as query planning and
> >>>>>>
> >>>>>> columnar
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> storage.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The intent of Spark
SQL though is to be more than a SQL
> >>>>>>>>>>>>>>>> server
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> meant to be a library
for manipulating structured data.
> >>>>>>>>>>>>>>>> Since
> >>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> possible to build over
the core API, it's pretty natural
> >>>
> >>> to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> organize it
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> that way, same as Spark
Streaming is a library.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Matei
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Jan 26, 2015,
at 4:26 PM, Koert Kuipers <
> >>>>>>
> >>>>>> koert@tresata.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> "The context
is that SchemaRDD is becoming a common
> >>>
> >>> data
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> format
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> bringing data
into Spark from external systems, and
> >>>
> >>> used
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> various
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> components of
Spark, e.g. MLlib's new pipeline API."
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> i agree. this
to me also implies it belongs in spark
> >>>>>>>>>>>>>>>>>> core,
> >>>>>>
> >>>>>> not
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> sql
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Mon, Jan
26, 2015 at 6:11 PM, Michael Malak <
> >>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid>
wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> And in the
off chance that anyone hasn't seen it yet,
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> Jan.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 13
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Bay
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Area
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Spark Meetup
YouTube contained a wealth of background
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> information
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> idea (mostly
from Patrick and Reynold :-).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> ________________________________
> >>>>>>>>>>>>>>>>>>> From: Patrick
Wendell <pwendell@gmail.com>
> >>>>>>>>>>>>>>>>>>> To: Reynold
Xin <rxin@databricks.com>
> >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org"
<dev@spark.apache.org>
> >>>>>>>>>>>>>>>>>>> Sent: Monday,
January 26, 2015 4:01 PM
> >>>>>>>>>>>>>>>>>>> Subject:
Re: renaming SchemaRDD -> DataFrame
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> One thing
potentially not clear from this e-mail,
> >>>
> >>> there
> >>>>>>
> >>>>>> will
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1:1
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> correspondence
where you can get an RDD to/from a
> >>>>>>
> >>>>>> DataFrame.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Mon,
Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> rxin@databricks.com>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> We are
considering renaming SchemaRDD -> DataFrame in
> >>>>>>>>>>>>>>>>>>>> 1.3,
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wanted
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> get
the community's opinion.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The
context is that SchemaRDD is becoming a common
> >>>
> >>> data
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> format
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> bringing
data into Spark from external systems, and
> >>>>>>>>>>>>>>>>>>>> used
> >>>>>>
> >>>>>> for
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> various
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> components
of Spark, e.g. MLlib's new pipeline API.
> >>>
> >>> We
> >>>>>>
> >>>>>> also
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> expect
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> more
users to be programming directly against
> >>>
> >>> SchemaRDD
> >>>>>>
> >>>>>> API
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> core
RDD API. SchemaRDD, through its less commonly
> >>>
> >>> used
> >>>>>>
> >>>>>> DSL
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> originally
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> designed
for writing test cases, always has the
> >>>>>>>>>>>>>>>>>>>> data-frame
> >>>>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> API.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1.3,
we are redesigning the API to make the API
> >>>
> >>> usable
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> end
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> users.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> There
are two motivations for the renaming:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. DataFrame
seems to be a more self-evident name
> >>>
> >>> than
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SchemaRDD.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame
is actually not going to be an
> >>>>>>>>>>>>>>>>>>>> RDD
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> anymore
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> (even
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> though
it would contain some RDD functions like map,
> >>>>>>>>>>>>>>>>>>>> flatMap,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> etc),
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> calling
it Schema*RDD* while it is not an RDD is
> >>>
> >>> highly
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> confusing.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Instead.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> DataFrame.rdd
will return the underlying RDD for all
> >>>>>>>>>>>>>>>>>>>> RDD
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> methods.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> My understanding
is that very few users program
> >>>>>>>>>>>>>>>>>>>> directly
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> against
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> SchemaRDD
API at the moment, because they are not
> >>>
> >>> well
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> documented.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> However,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> oo maintain
backward compatibility, we can create a
> >>>>>>>>>>>>>>>>>>>> type
> >>>>>>>>>>>>>>>>>>>> alias
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> DataFrame
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> that
is still named SchemaRDD. This will maintain
> >>>>>>>>>>>>>>>>>>>> source
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> compatibility
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Scala.
That said, we will have to update all existing
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> materials to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> DataFrame
rather than SchemaRDD.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> To unsubscribe,
e-mail:
> >>>
> >>> dev-unsubscribe@spark.apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For additional
commands, e-mail:
> >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> To unsubscribe,
e-mail:
> >>>
> >>> dev-unsubscribe@spark.apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For additional
commands, e-mail:
> >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> To unsubscribe, e-mail:
dev-unsubscribe@spark.apache.org
> >>>>>>>>>>>>>>>> For additional commands,
e-mail:
> >>>
> >>> dev-help@spark.apache.org
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>> ---------------------------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>
> >>>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message