spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dirceu Semighini Filho <dirceu.semigh...@gmail.com>
Subject Re: renaming SchemaRDD -> DataFrame
Date Tue, 27 Jan 2015 14:28:59 GMT
Can't the SchemaRDD remain the same, but deprecated, and be removed in the
release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
With this, we don't impact in existing code for the next few releases.



2015-01-27 0:02 GMT-02:00 Kushal Datta <kushal.datta@gmail.com>:

> I want to address the issue that Matei raised about the heavy lifting
> required for a full SQL support. It is amazing that even after 30 years of
> research there is not a single good open source columnar database like
> Vertica. There is a column store option in MySQL, but it is not nearly as
> sophisticated as Vertica or MonetDB. But there's a true need for such a
> system. I wonder why so and it's high time to change that.
> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sandy.ryza@cloudera.com> wrote:
>
> > Both SchemaRDD and DataFrame sound fine to me, though I like the former
> > slightly better because it's more descriptive.
> >
> > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
> > be more clear from a user-facing perspective to at least choose a package
> > name for it that omits "sql".
> >
> > I would also be in favor of adding a separate Spark Schema module for
> Spark
> > SQL to rely on, but I imagine that might be too large a change at this
> > point?
> >
> > -Sandy
> >
> > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <matei.zaharia@gmail.com>
> > wrote:
> >
> > > (Actually when we designed Spark SQL we thought of giving it another
> > name,
> > > like Spark Schema, but we decided to stick with SQL since that was the
> > most
> > > obvious use case to many users.)
> > >
> > > Matei
> > >
> > > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <matei.zaharia@gmail.com>
> > > wrote:
> > > >
> > > > While it might be possible to move this concept to Spark Core
> > long-term,
> > > supporting structured data efficiently does require quite a bit of the
> > > infrastructure in Spark SQL, such as query planning and columnar
> storage.
> > > The intent of Spark SQL though is to be more than a SQL server -- it's
> > > meant to be a library for manipulating structured data. Since this is
> > > possible to build over the core API, it's pretty natural to organize it
> > > that way, same as Spark Streaming is a library.
> > > >
> > > > Matei
> > > >
> > > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <koert@tresata.com>
> wrote:
> > > >>
> > > >> "The context is that SchemaRDD is becoming a common data format used
> > for
> > > >> bringing data into Spark from external systems, and used for various
> > > >> components of Spark, e.g. MLlib's new pipeline API."
> > > >>
> > > >> i agree. this to me also implies it belongs in spark core, not sql
> > > >>
> > > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > > >> michaelmalak@yahoo.com.invalid> wrote:
> > > >>
> > > >>> And in the off chance that anyone hasn't seen it yet, the Jan.
13
> Bay
> > > Area
> > > >>> Spark Meetup YouTube contained a wealth of background information
> on
> > > this
> > > >>> idea (mostly from Patrick and Reynold :-).
> > > >>>
> > > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > > >>>
> > > >>> ________________________________
> > > >>> From: Patrick Wendell <pwendell@gmail.com>
> > > >>> To: Reynold Xin <rxin@databricks.com>
> > > >>> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
> > > >>> Sent: Monday, January 26, 2015 4:01 PM
> > > >>> Subject: Re: renaming SchemaRDD -> DataFrame
> > > >>>
> > > >>>
> > > >>> One thing potentially not clear from this e-mail, there will be
a
> 1:1
> > > >>> correspondence where you can get an RDD to/from a DataFrame.
> > > >>>
> > > >>>
> > > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <rxin@databricks.com>
> > > wrote:
> > > >>>> Hi,
> > > >>>>
> > > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3,
and
> > wanted
> > > to
> > > >>>> get the community's opinion.
> > > >>>>
> > > >>>> The context is that SchemaRDD is becoming a common data format
> used
> > > for
> > > >>>> bringing data into Spark from external systems, and used for
> various
> > > >>>> components of Spark, e.g. MLlib's new pipeline API. We also
expect
> > > more
> > > >>> and
> > > >>>> more users to be programming directly against SchemaRDD API
rather
> > > than
> > > >>> the
> > > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
> > originally
> > > >>>> designed for writing test cases, always has the data-frame
like
> API.
> > > In
> > > >>>> 1.3, we are redesigning the API to make the API usable for
end
> > users.
> > > >>>>
> > > >>>>
> > > >>>> There are two motivations for the renaming:
> > > >>>>
> > > >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> > > >>>>
> > > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
anymore
> > > (even
> > > >>>> though it would contain some RDD functions like map, flatMap,
> etc),
> > > and
> > > >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> > > >>> Instead.
> > > >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> > > >>>>
> > > >>>>
> > > >>>> My understanding is that very few users program directly against
> the
> > > >>>> SchemaRDD API at the moment, because they are not well documented.
> > > >>> However,
> > > >>>> oo maintain backward compatibility, we can create a type alias
> > > DataFrame
> > > >>>> that is still named SchemaRDD. This will maintain source
> > compatibility
> > > >>> for
> > > >>>> Scala. That said, we will have to update all existing materials
to
> > use
> > > >>>> DataFrame rather than SchemaRDD.
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > >>> For additional commands, e-mail: dev-help@spark.apache.org
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > >>> For additional commands, e-mail: dev-help@spark.apache.org
> > > >>>
> > > >>>
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > For additional commands, e-mail: dev-help@spark.apache.org
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message