spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Chan <velvia.git...@gmail.com>
Subject Re: renaming SchemaRDD -> DataFrame
Date Thu, 29 Jan 2015 00:20:30 GMT
Hey guys,

How does this impact the data sources API?  I was planning on using
this for a project.

+1 that many things from spark-sql / DataFrame is universally
desirable and useful.

By the way, one thing that prevents the columnar compression stuff in
Spark SQL from being more useful is, at least from previous talks with
Reynold and Michael et al., that the format was not designed for
persistence.

I have a new project that aims to change that.  It is a
zero-serialisation, high performance binary vector library, designed
from the outset to be a persistent storage friendly.  May be one day
it can replace the Spark SQL columnar compression.

Michael told me this would be a lot of work, and recreates parts of
Parquet, but I think it's worth it.  LMK if you'd like more details.

-Evan

On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rxin@databricks.com> wrote:
> Alright I have merged the patch ( https://github.com/apache/spark/pull/4173
> ) since I don't see any strong opinions against it (as a matter of fact
> most were for it). We can still change it if somebody lays out a strong
> argument.
>
> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia <matei.zaharia@gmail.com>
> wrote:
>
>> The type alias means your methods can specify either type and they will
>> work. It's just another name for the same type. But Scaladocs and such will
>> show DataFrame as the type.
>>
>> Matei
>>
>> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> dirceu.semighini@gmail.com> wrote:
>> >
>> > Reynold,
>> > But with type alias we will have the same problem, right?
>> > If the methods doesn't receive schemardd anymore, we will have to change
>> > our code to migrade from schema to dataframe. Unless we have an implicit
>> > conversion between DataFrame and SchemaRDD
>> >
>> >
>> >
>> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <rxin@databricks.com>:
>> >
>> >> Dirceu,
>> >>
>> >> That is not possible because one cannot overload return types.
>> >>
>> >> SQLContext.parquetFile (and many other methods) needs to return some
>> type,
>> >> and that type cannot be both SchemaRDD and DataFrame.
>> >>
>> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to
>> not
>> >> break source compatibility for Scala.
>> >>
>> >>
>> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> >> dirceu.semighini@gmail.com> wrote:
>> >>
>> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed
in
>> the
>> >>> release 1.5(+/- 1)  for example, and the new code been added to
>> DataFrame?
>> >>> With this, we don't impact in existing code for the next few releases.
>> >>>
>> >>>
>> >>>
>> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <kushal.datta@gmail.com>:
>> >>>
>> >>>> I want to address the issue that Matei raised about the heavy lifting
>> >>>> required for a full SQL support. It is amazing that even after 30
>> years
>> >>> of
>> >>>> research there is not a single good open source columnar database
like
>> >>>> Vertica. There is a column store option in MySQL, but it is not
nearly
>> >>> as
>> >>>> sophisticated as Vertica or MonetDB. But there's a true need for
such
>> a
>> >>>> system. I wonder why so and it's high time to change that.
>> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sandy.ryza@cloudera.com>
>> wrote:
>> >>>>
>> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like
the
>> >>> former
>> >>>>> slightly better because it's more descriptive.
>> >>>>>
>> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers,
it
>> >>> would
>> >>>>> be more clear from a user-facing perspective to at least choose
a
>> >>> package
>> >>>>> name for it that omits "sql".
>> >>>>>
>> >>>>> I would also be in favor of adding a separate Spark Schema module
for
>> >>>> Spark
>> >>>>> SQL to rely on, but I imagine that might be too large a change
at
>> this
>> >>>>> point?
>> >>>>>
>> >>>>> -Sandy
>> >>>>>
>> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> >>> matei.zaharia@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> (Actually when we designed Spark SQL we thought of giving
it another
>> >>>>> name,
>> >>>>>> like Spark Schema, but we decided to stick with SQL since
that was
>> >>> the
>> >>>>> most
>> >>>>>> obvious use case to many users.)
>> >>>>>>
>> >>>>>> Matei
>> >>>>>>
>> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> >>> matei.zaharia@gmail.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> While it might be possible to move this concept to Spark
Core
>> >>>>> long-term,
>> >>>>>> supporting structured data efficiently does require quite
a bit of
>> >>> the
>> >>>>>> infrastructure in Spark SQL, such as query planning and
columnar
>> >>>> storage.
>> >>>>>> The intent of Spark SQL though is to be more than a SQL
server --
>> >>> it's
>> >>>>>> meant to be a library for manipulating structured data.
Since this
>> >>> is
>> >>>>>> possible to build over the core API, it's pretty natural
to
>> >>> organize it
>> >>>>>> that way, same as Spark Streaming is a library.
>> >>>>>>>
>> >>>>>>> Matei
>> >>>>>>>
>> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <koert@tresata.com>
>> >>>> wrote:
>> >>>>>>>>
>> >>>>>>>> "The context is that SchemaRDD is becoming a common
data format
>> >>> used
>> >>>>> for
>> >>>>>>>> bringing data into Spark from external systems,
and used for
>> >>> various
>> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>> >>>>>>>>
>> >>>>>>>> i agree. this to me also implies it belongs in spark
core, not
>> >>> sql
>> >>>>>>>>
>> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> >>>>>>>> michaelmalak@yahoo.com.invalid> wrote:
>> >>>>>>>>
>> >>>>>>>>> And in the off chance that anyone hasn't seen
it yet, the Jan.
>> >>> 13
>> >>>> Bay
>> >>>>>> Area
>> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
>> >>> information
>> >>>> on
>> >>>>>> this
>> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
>> >>>>>>>>>
>> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> >>>>>>>>>
>> >>>>>>>>> ________________________________
>> >>>>>>>>> From: Patrick Wendell <pwendell@gmail.com>
>> >>>>>>>>> To: Reynold Xin <rxin@databricks.com>
>> >>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
>> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> One thing potentially not clear from this e-mail,
there will be
>> >>> a
>> >>>> 1:1
>> >>>>>>>>> correspondence where you can get an RDD to/from
a DataFrame.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin
<
>> >>> rxin@databricks.com>
>> >>>>>> wrote:
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> We are considering renaming SchemaRDD ->
DataFrame in 1.3, and
>> >>>>> wanted
>> >>>>>> to
>> >>>>>>>>>> get the community's opinion.
>> >>>>>>>>>>
>> >>>>>>>>>> The context is that SchemaRDD is becoming
a common data format
>> >>>> used
>> >>>>>> for
>> >>>>>>>>>> bringing data into Spark from external systems,
and used for
>> >>>> various
>> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline
API. We also
>> >>> expect
>> >>>>>> more
>> >>>>>>>>> and
>> >>>>>>>>>> more users to be programming directly against
SchemaRDD API
>> >>> rather
>> >>>>>> than
>> >>>>>>>>> the
>> >>>>>>>>>> core RDD API. SchemaRDD, through its less
commonly used DSL
>> >>>>> originally
>> >>>>>>>>>> designed for writing test cases, always
has the data-frame like
>> >>>> API.
>> >>>>>> In
>> >>>>>>>>>> 1.3, we are redesigning the API to make
the API usable for end
>> >>>>> users.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> There are two motivations for the renaming:
>> >>>>>>>>>>
>> >>>>>>>>>> 1. DataFrame seems to be a more self-evident
name than
>> >>> SchemaRDD.
>> >>>>>>>>>>
>> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going
to be an RDD
>> >>> anymore
>> >>>>>> (even
>> >>>>>>>>>> though it would contain some RDD functions
like map, flatMap,
>> >>>> etc),
>> >>>>>> and
>> >>>>>>>>>> calling it Schema*RDD* while it is not an
RDD is highly
>> >>> confusing.
>> >>>>>>>>> Instead.
>> >>>>>>>>>> DataFrame.rdd will return the underlying
RDD for all RDD
>> >>> methods.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> My understanding is that very few users
program directly
>> >>> against
>> >>>> the
>> >>>>>>>>>> SchemaRDD API at the moment, because they
are not well
>> >>> documented.
>> >>>>>>>>> However,
>> >>>>>>>>>> oo maintain backward compatibility, we can
create a type alias
>> >>>>>> DataFrame
>> >>>>>>>>>> that is still named SchemaRDD. This will
maintain source
>> >>>>> compatibility
>> >>>>>>>>> for
>> >>>>>>>>>> Scala. That said, we will have to update
all existing
>> >>> materials to
>> >>>>> use
>> >>>>>>>>>> DataFrame rather than SchemaRDD.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>> ---------------------------------------------------------------------
>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>> ---------------------------------------------------------------------
>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>> ---------------------------------------------------------------------
>> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message