spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: renaming SchemaRDD -> DataFrame
Date Tue, 10 Feb 2015 20:14:41 GMT
It's a good point. I will update the documentation to say that this is not
meant to be subclassed externally.


On Tue, Feb 10, 2015 at 12:10 PM, Koert Kuipers <koert@tresata.com> wrote:

> thanks matei its good to know i can create them like that
>
> reynold, yeah somehow the words sql gets me going :) sorry...
> yeah agreed that you need new transformations to preserve the schema info.
> i misunderstood and thought i had to implement the bunch but that is
> clearly not necessary as matei indicated.
>
> allright i am clearly being slow/dense here, but now it makes sense to
> me....
>
>
>
>
>
>
> On Tue, Feb 10, 2015 at 2:58 PM, Reynold Xin <rxin@databricks.com> wrote:
>
> > Koert,
> >
> > Don't get too hang up on the name SQL. This is exactly what you want: a
> > collection with record-like objects with field names and runtime types.
> >
> > Almost all of the 40 methods are transformations for structured data,
> such
> > as aggregation on a field, or filtering on a field. If all you have is
> the
> > old RDD style map/flatMap, then any transformation would lose the schema
> > information, making the extra schema information useless.
> >
> >
> >
> >
> > On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers <koert@tresata.com>
> wrote:
> >
> >> so i understand the success or spark.sql. besides the fact that anything
> >> with the words SQL in its name will have thousands of developers running
> >> towards it because of the familiarity, there is also a genuine need for
> a
> >> generic RDD that holds record-like objects, with field names and runtime
> >> types. after all that is a successfull generic abstraction used in many
> >> structured data tools.
> >>
> >> but to me that abstraction is as simple as:
> >>
> >> trait SchemaRDD extends RDD[Row] {
> >>   def schema: StructType
> >> }
> >>
> >> and perhaps another abstraction to indicate it intends to be column
> >> oriented (with a few methods to efficiently extract a subset of
> columns).
> >> so that could be DataFrame.
> >>
> >> such simple contracts would allow many people to write loaders for this
> >> (say from csv) and whatnot.
> >>
> >> what i do not understand why it has to be much more complex than this.
> but
> >> if i look at DataFrame it has so much additional stuff, that has (in my
> >> eyes) nothing to do with generic structured data analysis.
> >>
> >> for example to implement DataFrame i need to implement about 40
> additional
> >> methods!? and for some the SQLness is obviously leaking into the
> >> abstraction. for example why would i care about:
> >>   def registerTempTable(tableName: String): Unit
> >>
> >>
> >> best, koert
> >>
> >> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <velvia.github@gmail.com>
> >> wrote:
> >>
> >> > It is true that you can persist SchemaRdds / DataFrames to disk via
> >> > Parquet, but a lot of time and inefficiencies is lost.   The in-memory
> >> > columnar cached representation is completely different from the
> >> > Parquet file format, and I believe there has to be a translation into
> >> > a Row (because ultimately Spark SQL traverses Row's -- even the
> >> > InMemoryColumnarTableScan has to then convert the columns into Rows
> >> > for row-based processing).   On the other hand, traditional data
> >> > frames process in a columnar fashion.   Columnar storage is good, but
> >> > nowhere near as good as columnar processing.
> >> >
> >> > Another issue, which I don't know if it is solved yet, but it is
> >> > difficult for Tachyon to efficiently cache Parquet files without
> >> > understanding the file format itself.
> >> >
> >> > I gave a talk at last year's Spark Summit on this topic.
> >> >
> >> > I'm working on efforts to change this, however.  Shoot me an email at
> >> > velvia at gmail if you're interested in joining forces.
> >> >
> >> > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <lian.cs.zju@gmail.com>
> >> wrote:
> >> > > Yes, when a DataFrame is cached in memory, it's stored in an
> efficient
> >> > > columnar format. And you can also easily persist it on disk using
> >> > Parquet,
> >> > > which is also columnar.
> >> > >
> >> > > Cheng
> >> > >
> >> > >
> >> > > On 1/29/15 1:24 PM, Koert Kuipers wrote:
> >> > >>
> >> > >> to me the word DataFrame does come with certain expectations. one
> of
> >> > them
> >> > >> is that the data is stored columnar. in R data.frame internally
> uses
> >> a
> >> > >> list
> >> > >> of sequences i think, but since lists can have labels its more
> like a
> >> > >> SortedMap[String, Array[_]]. this makes certain operations very
> cheap
> >> > >> (such
> >> > >> as adding a column).
> >> > >>
> >> > >> in Spark the closest thing would be a data structure where per
> >> Partition
> >> > >> the data is also stored columnar. does spark SQL already use
> >> something
> >> > >> like
> >> > >> that? Evan mentioned "Spark SQL columnar compression", which sounds
> >> like
> >> > >> it. where can i find that?
> >> > >>
> >> > >> thanks
> >> > >>
> >> > >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <
> velvia.github@gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >>> +1.... having proper NA support is much cleaner than using null,
> at
> >> > >>> least the Java null.
> >> > >>>
> >> > >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <
> >> evan.sparks@gmail.com
> >> > >
> >> > >>> wrote:
> >> > >>>>
> >> > >>>> You've got to be a little bit careful here. "NA" in systems like
> R
> >> or
> >> > >>>
> >> > >>> pandas
> >> > >>>>
> >> > >>>> may have special meaning that is distinct from "null".
> >> > >>>>
> >> > >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <
> rxin@databricks.com>
> >> > >>>
> >> > >>> wrote:
> >> > >>>>>
> >> > >>>>> Isn't that just "null" in SQL?
> >> > >>>>>
> >> > >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <
> >> velvia.github@gmail.com>
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>>> I believe that most DataFrame implementations out there, like
> >> > Pandas,
> >> > >>>>>> supports the idea of missing values / NA, and some support the
> >> idea
> >> > of
> >> > >>>>>> Not Meaningful as well.
> >> > >>>>>>
> >> > >>>>>> Does Row support anything like that?  That is important for
> >> certain
> >> > >>>>>> applications.  I thought that Row worked by being a mutable
> >> object,
> >> > >>>>>> but haven't looked into the details in a while.
> >> > >>>>>>
> >> > >>>>>> -Evan
> >> > >>>>>>
> >> > >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <
> >> rxin@databricks.com>
> >> > >>>>>> wrote:
> >> > >>>>>>>
> >> > >>>>>>> It shouldn't change the data source api at all because data
> >> sources
> >> > >>>>>>
> >> > >>>>>> create
> >> > >>>>>>>
> >> > >>>>>>> RDD[Row], and that gets converted into a DataFrame
> automatically
> >> > >>>>>>
> >> > >>>>>> (previously
> >> > >>>>>>>
> >> > >>>>>>> to SchemaRDD).
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>>>>
> >> > >>>
> >> > >>>
> >> >
> >>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >> > >>>>>>>
> >> > >>>>>>> One thing that will break the data source API in 1.3 is the
> >> > location
> >> > >>>>>>> of
> >> > >>>>>>> types. Types were previously defined in sql.catalyst.types,
> and
> >> now
> >> > >>>>>>
> >> > >>>>>> moved to
> >> > >>>>>>>
> >> > >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and
> all
> >> > >>>>>>> public
> >> > >>>>>>
> >> > >>>>>> APIs
> >> > >>>>>>>
> >> > >>>>>>> have first class classes/objects defined in sql directly.
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <
> >> > velvia.github@gmail.com
> >> > >>>>>>
> >> > >>>>>> wrote:
> >> > >>>>>>>>
> >> > >>>>>>>> Hey guys,
> >> > >>>>>>>>
> >> > >>>>>>>> How does this impact the data sources API?  I was planning on
> >> > using
> >> > >>>>>>>> this for a project.
> >> > >>>>>>>>
> >> > >>>>>>>> +1 that many things from spark-sql / DataFrame is universally
> >> > >>>>>>>> desirable and useful.
> >> > >>>>>>>>
> >> > >>>>>>>> By the way, one thing that prevents the columnar compression
> >> stuff
> >> > >>>
> >> > >>> in
> >> > >>>>>>>>
> >> > >>>>>>>> Spark SQL from being more useful is, at least from previous
> >> talks
> >> > >>>>>>>> with
> >> > >>>>>>>> Reynold and Michael et al., that the format was not designed
> >> for
> >> > >>>>>>>> persistence.
> >> > >>>>>>>>
> >> > >>>>>>>> I have a new project that aims to change that.  It is a
> >> > >>>>>>>> zero-serialisation, high performance binary vector library,
> >> > >>>
> >> > >>> designed
> >> > >>>>>>>>
> >> > >>>>>>>> from the outset to be a persistent storage friendly.  May be
> >> one
> >> > >>>
> >> > >>> day
> >> > >>>>>>>>
> >> > >>>>>>>> it can replace the Spark SQL columnar compression.
> >> > >>>>>>>>
> >> > >>>>>>>> Michael told me this would be a lot of work, and recreates
> >> parts
> >> > of
> >> > >>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
> >> > >>>
> >> > >>> details.
> >> > >>>>>>>>
> >> > >>>>>>>> -Evan
> >> > >>>>>>>>
> >> > >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <
> >> rxin@databricks.com
> >> > >
> >> > >>>>>>
> >> > >>>>>> wrote:
> >> > >>>>>>>>>
> >> > >>>>>>>>> Alright I have merged the patch (
> >> > >>>>>>>>> https://github.com/apache/spark/pull/4173
> >> > >>>>>>>>> ) since I don't see any strong opinions against it (as a
> >> matter
> >> > >>>
> >> > >>> of
> >> > >>>>>>
> >> > >>>>>> fact
> >> > >>>>>>>>>
> >> > >>>>>>>>> most were for it). We can still change it if somebody lays
> >> out a
> >> > >>>>>>
> >> > >>>>>> strong
> >> > >>>>>>>>>
> >> > >>>>>>>>> argument.
> >> > >>>>>>>>>
> >> > >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >> > >>>>>>>>> <matei.zaharia@gmail.com>
> >> > >>>>>>>>> wrote:
> >> > >>>>>>>>>
> >> > >>>>>>>>>> The type alias means your methods can specify either type
> and
> >> > >>>
> >> > >>> they
> >> > >>>>>>
> >> > >>>>>> will
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> work. It's just another name for the same type. But
> Scaladocs
> >> > >>>
> >> > >>> and
> >> > >>>>>>
> >> > >>>>>> such
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> will
> >> > >>>>>>>>>> show DataFrame as the type.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Matei
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Reynold,
> >> > >>>>>>>>>>> But with type alias we will have the same problem, right?
> >> > >>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will
> >> have
> >> > >>>>>>>>>>> to
> >> > >>>>>>>>>>> change
> >> > >>>>>>>>>>> our code to migrade from schema to dataframe. Unless we
> have
> >> > >>>
> >> > >>> an
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> implicit
> >> > >>>>>>>>>>> conversion between DataFrame and SchemaRDD
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <
> rxin@databricks.com
> >> >:
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> Dirceu,
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> That is not possible because one cannot overload return
> >> > >>>
> >> > >>> types.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
> >> > >>>
> >> > >>> return
> >> > >>>>>>
> >> > >>>>>> some
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> type,
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
> >> > >>>>>>>>>>>> SchemaRDD
> >> > >>>>>>>>>>>> to
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> not
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> break source compatibility for Scala.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> > >>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and
> >> be
> >> > >>>>>>
> >> > >>>>>> removed
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> in
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> the
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been
> >> added
> >> > >>>>>>>>>>>>> to
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> DataFrame?
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> With this, we don't impact in existing code for the next
> >> few
> >> > >>>>>>>>>>>>> releases.
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
> >> > >>>>>>>>>>>>> <kushal.datta@gmail.com>:
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> I want to address the issue that Matei raised about the
> >> > >>>
> >> > >>> heavy
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> lifting
> >> > >>>>>>>>>>>>>> required for a full SQL support. It is amazing that
> even
> >> > >>>>>>>>>>>>>> after
> >> > >>>>>>
> >> > >>>>>> 30
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> years
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> of
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> research there is not a single good open source
> columnar
> >> > >>>>>>
> >> > >>>>>> database
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> like
> >> > >>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but
> it
> >> is
> >> > >>>>>>>>>>>>>> not
> >> > >>>>>>>>>>>>>> nearly
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> as
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
> >> > >>>
> >> > >>> need
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>> such
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> a
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> system. I wonder why so and it's high time to change
> >> that.
> >> > >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
> >> > >>>>>>>>>>>>>> <sandy.ryza@cloudera.com>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> wrote:
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though
> I
> >> > >>>
> >> > >>> like
> >> > >>>>>>
> >> > >>>>>> the
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> former
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> slightly better because it's more descriptive.
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under
> the
> >> > >>>>>>
> >> > >>>>>> covers,
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> it
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> would
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> be more clear from a user-facing perspective to at
> least
> >> > >>>>>>
> >> > >>>>>> choose a
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> package
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> name for it that omits "sql".
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark
> >> Schema
> >> > >>>>>>
> >> > >>>>>> module
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> Spark
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large
> a
> >> > >>>>>>>>>>>>>>> change
> >> > >>>>>>
> >> > >>>>>> at
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> this
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> point?
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> -Sandy
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> matei.zaharia@gmail.com>
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> wrote:
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of
> >> giving
> >> > >>>>>>>>>>>>>>>> it
> >> > >>>>>>>>>>>>>>>> another
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> name,
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL
> >> since
> >> > >>>>>>>>>>>>>>>> that
> >> > >>>>>>>>>>>>>>>> was
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> the
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> most
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> obvious use case to many users.)
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> Matei
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> matei.zaharia@gmail.com>
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> wrote:
> >> > >>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>> While it might be possible to move this concept to
> >> Spark
> >> > >>>>>>>>>>>>>>>>> Core
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> long-term,
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> supporting structured data efficiently does require
> >> > >>>
> >> > >>> quite a
> >> > >>>>>>
> >> > >>>>>> bit
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> of
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> the
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning
> and
> >> > >>>>>>
> >> > >>>>>> columnar
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> storage.
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a
> SQL
> >> > >>>>>>>>>>>>>>>> server
> >> > >>>>>>>>>>>>>>>> --
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> it's
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> meant to be a library for manipulating structured
> data.
> >> > >>>>>>>>>>>>>>>> Since
> >> > >>>>>>>>>>>>>>>> this
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> is
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> possible to build over the core API, it's pretty
> >> natural
> >> > >>>
> >> > >>> to
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> organize it
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
> >> > >>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>> Matei
> >> > >>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
> >> > >>>>>>
> >> > >>>>>> koert@tresata.com>
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> wrote:
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
> >> > >>>
> >> > >>> data
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> format
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> used
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
> >> > >>>
> >> > >>> used
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> various
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline
> API."
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in
> spark
> >> > >>>>>>>>>>>>>>>>>> core,
> >> > >>>>>>
> >> > >>>>>> not
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> sql
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> > >>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> >> > >>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it
> >> yet,
> >> > >>>>>>>>>>>>>>>>>>> the
> >> > >>>>>>>>>>>>>>>>>>> Jan.
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> 13
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> Bay
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> Area
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of
> >> background
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> information
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> on
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> this
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> ________________________________
> >> > >>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pwendell@gmail.com>
> >> > >>>>>>>>>>>>>>>>>>> To: Reynold Xin <rxin@databricks.com>
> >> > >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
> >> > >>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> >> > >>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
> >> > >>>
> >> > >>> there
> >> > >>>>>>
> >> > >>>>>> will
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> be
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> a
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> 1:1
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
> >> > >>>>>>
> >> > >>>>>> DataFrame.
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> rxin@databricks.com>
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> wrote:
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> Hi,
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD ->
> DataFrame
> >> in
> >> > >>>>>>>>>>>>>>>>>>>> 1.3,
> >> > >>>>>>>>>>>>>>>>>>>> and
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> wanted
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> to
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> get the community's opinion.
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a
> common
> >> > >>>
> >> > >>> data
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> format
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> used
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems,
> and
> >> > >>>>>>>>>>>>>>>>>>>> used
> >> > >>>>>>
> >> > >>>>>> for
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> various
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline
> API.
> >> > >>>
> >> > >>> We
> >> > >>>>>>
> >> > >>>>>> also
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> expect
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> more
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> and
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> more users to be programming directly against
> >> > >>>
> >> > >>> SchemaRDD
> >> > >>>>>>
> >> > >>>>>> API
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> rather
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> than
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> the
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less
> commonly
> >> > >>>
> >> > >>> used
> >> > >>>>>>
> >> > >>>>>> DSL
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> originally
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
> >> > >>>>>>>>>>>>>>>>>>>> data-frame
> >> > >>>>>>>>>>>>>>>>>>>> like
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> API.
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> In
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
> >> > >>>
> >> > >>> usable
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>>>>>>>> end
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> users.
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
> >> > >>>
> >> > >>> than
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> SchemaRDD.
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to
> be
> >> an
> >> > >>>>>>>>>>>>>>>>>>>> RDD
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> anymore
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> (even
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like
> >> map,
> >> > >>>>>>>>>>>>>>>>>>>> flatMap,
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> etc),
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> and
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
> >> > >>>
> >> > >>> highly
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> confusing.
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> Instead.
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for
> >> all
> >> > >>>>>>>>>>>>>>>>>>>> RDD
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> methods.
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
> >> > >>>>>>>>>>>>>>>>>>>> directly
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> against
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>> the
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
> >> > >>>
> >> > >>> well
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> documented.
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> However,
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can
> create a
> >> > >>>>>>>>>>>>>>>>>>>> type
> >> > >>>>>>>>>>>>>>>>>>>> alias
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> DataFrame
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
> >> > >>>>>>>>>>>>>>>>>>>> source
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> compatibility
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> for
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all
> >> existing
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> materials to
> >> > >>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>> use
> >> > >>>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>
> >> > ---------------------------------------------------------------------
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >> > >>>
> >> > >>> dev-unsubscribe@spark.apache.org
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> >> > >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>
> >> > >>>>>>
> >> > ---------------------------------------------------------------------
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >> > >>>
> >> > >>> dev-unsubscribe@spark.apache.org
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> >> > >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>
> >> > >>>>>>
> >> > ---------------------------------------------------------------------
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> >> dev-unsubscribe@spark.apache.org
> >> > >>>>>>>>>>>>>>>> For additional commands, e-mail:
> >> > >>>
> >> > >>> dev-help@spark.apache.org
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>>>>>
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>
> >> ---------------------------------------------------------------------
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> > >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>
> >> > >>>>
> >> > >>>
> >> ---------------------------------------------------------------------
> >> > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> > >>> For additional commands, e-mail: dev-help@spark.apache.org
> >> > >>>
> >> > >>>
> >> > >
> >> > >
> >> > >
> ---------------------------------------------------------------------
> >> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> > > For additional commands, e-mail: dev-help@spark.apache.org
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message