spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: renaming SchemaRDD -> DataFrame
Date Tue, 10 Feb 2015 19:58:01 GMT
Koert,

Don't get too hang up on the name SQL. This is exactly what you want: a
collection with record-like objects with field names and runtime types.

Almost all of the 40 methods are transformations for structured data, such
as aggregation on a field, or filtering on a field. If all you have is the
old RDD style map/flatMap, then any transformation would lose the schema
information, making the extra schema information useless.




On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers <koert@tresata.com> wrote:

> so i understand the success or spark.sql. besides the fact that anything
> with the words SQL in its name will have thousands of developers running
> towards it because of the familiarity, there is also a genuine need for a
> generic RDD that holds record-like objects, with field names and runtime
> types. after all that is a successfull generic abstraction used in many
> structured data tools.
>
> but to me that abstraction is as simple as:
>
> trait SchemaRDD extends RDD[Row] {
>   def schema: StructType
> }
>
> and perhaps another abstraction to indicate it intends to be column
> oriented (with a few methods to efficiently extract a subset of columns).
> so that could be DataFrame.
>
> such simple contracts would allow many people to write loaders for this
> (say from csv) and whatnot.
>
> what i do not understand why it has to be much more complex than this. but
> if i look at DataFrame it has so much additional stuff, that has (in my
> eyes) nothing to do with generic structured data analysis.
>
> for example to implement DataFrame i need to implement about 40 additional
> methods!? and for some the SQLness is obviously leaking into the
> abstraction. for example why would i care about:
>   def registerTempTable(tableName: String): Unit
>
>
> best, koert
>
> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <velvia.github@gmail.com> wrote:
>
> > It is true that you can persist SchemaRdds / DataFrames to disk via
> > Parquet, but a lot of time and inefficiencies is lost.   The in-memory
> > columnar cached representation is completely different from the
> > Parquet file format, and I believe there has to be a translation into
> > a Row (because ultimately Spark SQL traverses Row's -- even the
> > InMemoryColumnarTableScan has to then convert the columns into Rows
> > for row-based processing).   On the other hand, traditional data
> > frames process in a columnar fashion.   Columnar storage is good, but
> > nowhere near as good as columnar processing.
> >
> > Another issue, which I don't know if it is solved yet, but it is
> > difficult for Tachyon to efficiently cache Parquet files without
> > understanding the file format itself.
> >
> > I gave a talk at last year's Spark Summit on this topic.
> >
> > I'm working on efforts to change this, however.  Shoot me an email at
> > velvia at gmail if you're interested in joining forces.
> >
> > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <lian.cs.zju@gmail.com>
> wrote:
> > > Yes, when a DataFrame is cached in memory, it's stored in an efficient
> > > columnar format. And you can also easily persist it on disk using
> > Parquet,
> > > which is also columnar.
> > >
> > > Cheng
> > >
> > >
> > > On 1/29/15 1:24 PM, Koert Kuipers wrote:
> > >>
> > >> to me the word DataFrame does come with certain expectations. one of
> > them
> > >> is that the data is stored columnar. in R data.frame internally uses a
> > >> list
> > >> of sequences i think, but since lists can have labels its more like a
> > >> SortedMap[String, Array[_]]. this makes certain operations very cheap
> > >> (such
> > >> as adding a column).
> > >>
> > >> in Spark the closest thing would be a data structure where per
> Partition
> > >> the data is also stored columnar. does spark SQL already use something
> > >> like
> > >> that? Evan mentioned "Spark SQL columnar compression", which sounds
> like
> > >> it. where can i find that?
> > >>
> > >> thanks
> > >>
> > >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <velvia.github@gmail.com>
> > >> wrote:
> > >>
> > >>> +1.... having proper NA support is much cleaner than using null, at
> > >>> least the Java null.
> > >>>
> > >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <
> evan.sparks@gmail.com
> > >
> > >>> wrote:
> > >>>>
> > >>>> You've got to be a little bit careful here. "NA" in systems like R
> or
> > >>>
> > >>> pandas
> > >>>>
> > >>>> may have special meaning that is distinct from "null".
> > >>>>
> > >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rxin@databricks.com>
> > >>>
> > >>> wrote:
> > >>>>>
> > >>>>> Isn't that just "null" in SQL?
> > >>>>>
> > >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <
> velvia.github@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> I believe that most DataFrame implementations out there, like
> > Pandas,
> > >>>>>> supports the idea of missing values / NA, and some support the
> idea
> > of
> > >>>>>> Not Meaningful as well.
> > >>>>>>
> > >>>>>> Does Row support anything like that?  That is important for
> certain
> > >>>>>> applications.  I thought that Row worked by being a mutable
> object,
> > >>>>>> but haven't looked into the details in a while.
> > >>>>>>
> > >>>>>> -Evan
> > >>>>>>
> > >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rxin@databricks.com
> >
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> It shouldn't change the data source api at all because data
> sources
> > >>>>>>
> > >>>>>> create
> > >>>>>>>
> > >>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
> > >>>>>>
> > >>>>>> (previously
> > >>>>>>>
> > >>>>>>> to SchemaRDD).
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>>
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> > >>>>>>>
> > >>>>>>> One thing that will break the data source API in 1.3 is the
> > location
> > >>>>>>> of
> > >>>>>>> types. Types were previously defined in sql.catalyst.types, and
> now
> > >>>>>>
> > >>>>>> moved to
> > >>>>>>>
> > >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
> > >>>>>>> public
> > >>>>>>
> > >>>>>> APIs
> > >>>>>>>
> > >>>>>>> have first class classes/objects defined in sql directly.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <
> > velvia.github@gmail.com
> > >>>>>>
> > >>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> Hey guys,
> > >>>>>>>>
> > >>>>>>>> How does this impact the data sources API?  I was planning on
> > using
> > >>>>>>>> this for a project.
> > >>>>>>>>
> > >>>>>>>> +1 that many things from spark-sql / DataFrame is universally
> > >>>>>>>> desirable and useful.
> > >>>>>>>>
> > >>>>>>>> By the way, one thing that prevents the columnar compression
> stuff
> > >>>
> > >>> in
> > >>>>>>>>
> > >>>>>>>> Spark SQL from being more useful is, at least from previous
> talks
> > >>>>>>>> with
> > >>>>>>>> Reynold and Michael et al., that the format was not designed for
> > >>>>>>>> persistence.
> > >>>>>>>>
> > >>>>>>>> I have a new project that aims to change that.  It is a
> > >>>>>>>> zero-serialisation, high performance binary vector library,
> > >>>
> > >>> designed
> > >>>>>>>>
> > >>>>>>>> from the outset to be a persistent storage friendly.  May be one
> > >>>
> > >>> day
> > >>>>>>>>
> > >>>>>>>> it can replace the Spark SQL columnar compression.
> > >>>>>>>>
> > >>>>>>>> Michael told me this would be a lot of work, and recreates parts
> > of
> > >>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
> > >>>
> > >>> details.
> > >>>>>>>>
> > >>>>>>>> -Evan
> > >>>>>>>>
> > >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <
> rxin@databricks.com
> > >
> > >>>>>>
> > >>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Alright I have merged the patch (
> > >>>>>>>>> https://github.com/apache/spark/pull/4173
> > >>>>>>>>> ) since I don't see any strong opinions against it (as a matter
> > >>>
> > >>> of
> > >>>>>>
> > >>>>>> fact
> > >>>>>>>>>
> > >>>>>>>>> most were for it). We can still change it if somebody lays out
> a
> > >>>>>>
> > >>>>>> strong
> > >>>>>>>>>
> > >>>>>>>>> argument.
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> > >>>>>>>>> <matei.zaharia@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> The type alias means your methods can specify either type and
> > >>>
> > >>> they
> > >>>>>>
> > >>>>>> will
> > >>>>>>>>>>
> > >>>>>>>>>> work. It's just another name for the same type. But Scaladocs
> > >>>
> > >>> and
> > >>>>>>
> > >>>>>> such
> > >>>>>>>>>>
> > >>>>>>>>>> will
> > >>>>>>>>>> show DataFrame as the type.
> > >>>>>>>>>>
> > >>>>>>>>>> Matei
> > >>>>>>>>>>
> > >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> > >>>>>>>>>>
> > >>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Reynold,
> > >>>>>>>>>>> But with type alias we will have the same problem, right?
> > >>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will
> have
> > >>>>>>>>>>> to
> > >>>>>>>>>>> change
> > >>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have
> > >>>
> > >>> an
> > >>>>>>>>>>>
> > >>>>>>>>>>> implicit
> > >>>>>>>>>>> conversion between DataFrame and SchemaRDD
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rxin@databricks.com
> >:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Dirceu,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> That is not possible because one cannot overload return
> > >>>
> > >>> types.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
> > >>>
> > >>> return
> > >>>>>>
> > >>>>>> some
> > >>>>>>>>>>
> > >>>>>>>>>> type,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
> > >>>>>>>>>>>> SchemaRDD
> > >>>>>>>>>>>> to
> > >>>>>>>>>>
> > >>>>>>>>>> not
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> break source compatibility for Scala.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> > >>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and be
> > >>>>>>
> > >>>>>> removed
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> in
> > >>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been
> added
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>
> > >>>>>>>>>> DataFrame?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> With this, we don't impact in existing code for the next
> few
> > >>>>>>>>>>>>> releases.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
> > >>>>>>>>>>>>> <kushal.datta@gmail.com>:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I want to address the issue that Matei raised about the
> > >>>
> > >>> heavy
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> lifting
> > >>>>>>>>>>>>>> required for a full SQL support. It is amazing that even
> > >>>>>>>>>>>>>> after
> > >>>>>>
> > >>>>>> 30
> > >>>>>>>>>>
> > >>>>>>>>>> years
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> research there is not a single good open source columnar
> > >>>>>>
> > >>>>>> database
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it
> is
> > >>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>> nearly
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
> > >>>
> > >>> need
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>> such
> > >>>>>>>>>>
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> system. I wonder why so and it's high time to change that.
> > >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
> > >>>>>>>>>>>>>> <sandy.ryza@cloudera.com>
> > >>>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I
> > >>>
> > >>> like
> > >>>>>>
> > >>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> former
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> slightly better because it's more descriptive.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
> > >>>>>>
> > >>>>>> covers,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least
> > >>>>>>
> > >>>>>> choose a
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> package
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> name for it that omits "sql".
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark
> Schema
> > >>>>>>
> > >>>>>> module
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Spark
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a
> > >>>>>>>>>>>>>>> change
> > >>>>>>
> > >>>>>> at
> > >>>>>>>>>>
> > >>>>>>>>>> this
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> point?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> -Sandy
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> matei.zaharia@gmail.com>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of
> giving
> > >>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>> another
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> name,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL
> since
> > >>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>> was
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> most
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> obvious use case to many users.)
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Matei
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> matei.zaharia@gmail.com>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> While it might be possible to move this concept to
> Spark
> > >>>>>>>>>>>>>>>>> Core
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> long-term,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> supporting structured data efficiently does require
> > >>>
> > >>> quite a
> > >>>>>>
> > >>>>>> bit
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and
> > >>>>>>
> > >>>>>> columnar
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> storage.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL
> > >>>>>>>>>>>>>>>> server
> > >>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> it's
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> meant to be a library for manipulating structured data.
> > >>>>>>>>>>>>>>>> Since
> > >>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> possible to build over the core API, it's pretty natural
> > >>>
> > >>> to
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> organize it
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Matei
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
> > >>>>>>
> > >>>>>> koert@tresata.com>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
> > >>>
> > >>> data
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> format
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> used
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
> > >>>
> > >>> used
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> various
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark
> > >>>>>>>>>>>>>>>>>> core,
> > >>>>>>
> > >>>>>> not
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> sql
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > >>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it yet,
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> Jan.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 13
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Bay
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Area
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of background
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> information
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> ________________________________
> > >>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pwendell@gmail.com>
> > >>>>>>>>>>>>>>>>>>> To: Reynold Xin <rxin@databricks.com>
> > >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
> > >>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> > >>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
> > >>>
> > >>> there
> > >>>>>>
> > >>>>>> will
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1:1
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
> > >>>>>>
> > >>>>>> DataFrame.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> rxin@databricks.com>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame
> in
> > >>>>>>>>>>>>>>>>>>>> 1.3,
> > >>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wanted
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> get the community's opinion.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common
> > >>>
> > >>> data
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> format
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> used
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
> > >>>>>>>>>>>>>>>>>>>> used
> > >>>>>>
> > >>>>>> for
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> various
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
> > >>>
> > >>> We
> > >>>>>>
> > >>>>>> also
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> expect
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> more users to be programming directly against
> > >>>
> > >>> SchemaRDD
> > >>>>>>
> > >>>>>> API
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> rather
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
> > >>>
> > >>> used
> > >>>>>>
> > >>>>>> DSL
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> originally
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
> > >>>>>>>>>>>>>>>>>>>> data-frame
> > >>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> API.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> In
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
> > >>>
> > >>> usable
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>> end
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> users.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
> > >>>
> > >>> than
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> SchemaRDD.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be
> an
> > >>>>>>>>>>>>>>>>>>>> RDD
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> anymore
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> (even
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like map,
> > >>>>>>>>>>>>>>>>>>>> flatMap,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> etc),
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
> > >>>
> > >>> highly
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> confusing.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Instead.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
> > >>>>>>>>>>>>>>>>>>>> RDD
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> methods.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
> > >>>>>>>>>>>>>>>>>>>> directly
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> against
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
> > >>>
> > >>> well
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> documented.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> However,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a
> > >>>>>>>>>>>>>>>>>>>> type
> > >>>>>>>>>>>>>>>>>>>> alias
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> DataFrame
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
> > >>>>>>>>>>>>>>>>>>>> source
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> compatibility
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all
> existing
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> materials to
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>
> > ---------------------------------------------------------------------
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> > >>>
> > >>> dev-unsubscribe@spark.apache.org
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> > >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>
> > ---------------------------------------------------------------------
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> > >>>
> > >>> dev-unsubscribe@spark.apache.org
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
> > >>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>
> > ---------------------------------------------------------------------
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscribe@spark.apache.org
> > >>>>>>>>>>>>>>>> For additional commands, e-mail:
> > >>>
> > >>> dev-help@spark.apache.org
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>> ---------------------------------------------------------------------
> > >>>>>>>>>>
> > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >>> For additional commands, e-mail: dev-help@spark.apache.org
> > >>>
> > >>>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > For additional commands, e-mail: dev-help@spark.apache.org
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message