spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: renaming SchemaRDD -> DataFrame
Date Thu, 29 Jan 2015 21:59:09 GMT
Yes, when a DataFrame is cached in memory, it's stored in an efficient 
columnar format. And you can also easily persist it on disk using 
Parquet, which is also columnar.

Cheng

On 1/29/15 1:24 PM, Koert Kuipers wrote:
> to me the word DataFrame does come with certain expectations. one of them
> is that the data is stored columnar. in R data.frame internally uses a list
> of sequences i think, but since lists can have labels its more like a
> SortedMap[String, Array[_]]. this makes certain operations very cheap (such
> as adding a column).
>
> in Spark the closest thing would be a data structure where per Partition
> the data is also stored columnar. does spark SQL already use something like
> that? Evan mentioned "Spark SQL columnar compression", which sounds like
> it. where can i find that?
>
> thanks
>
> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <velvia.github@gmail.com> wrote:
>
>> +1.... having proper NA support is much cleaner than using null, at
>> least the Java null.
>>
>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <evan.sparks@gmail.com>
>> wrote:
>>> You've got to be a little bit careful here. "NA" in systems like R or
>> pandas
>>> may have special meaning that is distinct from "null".
>>>
>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>>>
>>>
>>>
>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <rxin@databricks.com>
>> wrote:
>>>> Isn't that just "null" in SQL?
>>>>
>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <velvia.github@gmail.com>
>>>> wrote:
>>>>
>>>>> I believe that most DataFrame implementations out there, like Pandas,
>>>>> supports the idea of missing values / NA, and some support the idea of
>>>>> Not Meaningful as well.
>>>>>
>>>>> Does Row support anything like that?  That is important for certain
>>>>> applications.  I thought that Row worked by being a mutable object,
>>>>> but haven't looked into the details in a while.
>>>>>
>>>>> -Evan
>>>>>
>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <rxin@databricks.com>
>>>>> wrote:
>>>>>> It shouldn't change the data source api at all because data sources
>>>>> create
>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
>>>>> (previously
>>>>>> to SchemaRDD).
>>>>>>
>>>>>>
>>>>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>>>>>> One thing that will break the data source API in 1.3 is the location
>>>>>> of
>>>>>> types. Types were previously defined in sql.catalyst.types, and now
>>>>> moved to
>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
>>>>>> public
>>>>> APIs
>>>>>> have first class classes/objects defined in sql directly.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.github@gmail.com
>>>>> wrote:
>>>>>>> Hey guys,
>>>>>>>
>>>>>>> How does this impact the data sources API?  I was planning on
using
>>>>>>> this for a project.
>>>>>>>
>>>>>>> +1 that many things from spark-sql / DataFrame is universally
>>>>>>> desirable and useful.
>>>>>>>
>>>>>>> By the way, one thing that prevents the columnar compression
stuff
>> in
>>>>>>> Spark SQL from being more useful is, at least from previous talks
>>>>>>> with
>>>>>>> Reynold and Michael et al., that the format was not designed
for
>>>>>>> persistence.
>>>>>>>
>>>>>>> I have a new project that aims to change that.  It is a
>>>>>>> zero-serialisation, high performance binary vector library,
>> designed
>>>>>>> from the outset to be a persistent storage friendly.  May be
one
>> day
>>>>>>> it can replace the Spark SQL columnar compression.
>>>>>>>
>>>>>>> Michael told me this would be a lot of work, and recreates parts
of
>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
>> details.
>>>>>>> -Evan
>>>>>>>
>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <rxin@databricks.com>
>>>>> wrote:
>>>>>>>> Alright I have merged the patch (
>>>>>>>> https://github.com/apache/spark/pull/4173
>>>>>>>> ) since I don't see any strong opinions against it (as a
matter
>> of
>>>>> fact
>>>>>>>> most were for it). We can still change it if somebody lays
out a
>>>>> strong
>>>>>>>> argument.
>>>>>>>>
>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>>>>>>>> <matei.zaharia@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The type alias means your methods can specify either
type and
>> they
>>>>> will
>>>>>>>>> work. It's just another name for the same type. But Scaladocs
>> and
>>>>> such
>>>>>>>>> will
>>>>>>>>> show DataFrame as the type.
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>>>
>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho
<
>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>>> Reynold,
>>>>>>>>>> But with type alias we will have the same problem,
right?
>>>>>>>>>> If the methods doesn't receive schemardd anymore,
we will have
>>>>>>>>>> to
>>>>>>>>>> change
>>>>>>>>>> our code to migrade from schema to dataframe. Unless
we have
>> an
>>>>>>>>>> implicit
>>>>>>>>>> conversion between DataFrame and SchemaRDD
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <rxin@databricks.com>:
>>>>>>>>>>
>>>>>>>>>>> Dirceu,
>>>>>>>>>>>
>>>>>>>>>>> That is not possible because one cannot overload
return
>> types.
>>>>>>>>>>> SQLContext.parquetFile (and many other methods)
needs to
>> return
>>>>> some
>>>>>>>>> type,
>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
>>>>>>>>>>>
>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame
called
>>>>>>>>>>> SchemaRDD
>>>>>>>>>>> to
>>>>>>>>> not
>>>>>>>>>>> break source compatibility for Scala.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini
Filho <
>>>>>>>>>>> dirceu.semighini@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Can't the SchemaRDD remain the same, but
deprecated, and be
>>>>> removed
>>>>>>>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the
new code been added
>>>>>>>>>>>> to
>>>>>>>>> DataFrame?
>>>>>>>>>>>> With this, we don't impact in existing code
for the next few
>>>>>>>>>>>> releases.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
>>>>>>>>>>>> <kushal.datta@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> I want to address the issue that Matei
raised about the
>> heavy
>>>>>>>>>>>>> lifting
>>>>>>>>>>>>> required for a full SQL support. It is
amazing that even
>>>>>>>>>>>>> after
>>>>> 30
>>>>>>>>> years
>>>>>>>>>>>> of
>>>>>>>>>>>>> research there is not a single good open
source columnar
>>>>> database
>>>>>>>>>>>>> like
>>>>>>>>>>>>> Vertica. There is a column store option
in MySQL, but it is
>>>>>>>>>>>>> not
>>>>>>>>>>>>> nearly
>>>>>>>>>>>> as
>>>>>>>>>>>>> sophisticated as Vertica or MonetDB.
But there's a true
>> need
>>>>>>>>>>>>> for
>>>>>>>>>>>>> such
>>>>>>>>> a
>>>>>>>>>>>>> system. I wonder why so and it's high
time to change that.
>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
>>>>>>>>>>>>> <sandy.ryza@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound
fine to me, though I
>> like
>>>>> the
>>>>>>>>>>>> former
>>>>>>>>>>>>>> slightly better because it's more
descriptive.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely
on Spark SQL under the
>>>>> covers,
>>>>>>>>>>>>>> it
>>>>>>>>>>>> would
>>>>>>>>>>>>>> be more clear from a user-facing
perspective to at least
>>>>> choose a
>>>>>>>>>>>> package
>>>>>>>>>>>>>> name for it that omits "sql".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would also be in favor of adding
a separate Spark Schema
>>>>> module
>>>>>>>>>>>>>> for
>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>> SQL to rely on, but I imagine that
might be too large a
>>>>>>>>>>>>>> change
>>>>> at
>>>>>>>>> this
>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Sandy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM,
Matei Zaharia <
>>>>>>>>>>>> matei.zaharia@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (Actually when we designed Spark
SQL we thought of giving
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>> name,
>>>>>>>>>>>>>>> like Spark Schema, but we decided
to stick with SQL since
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> was
>>>>>>>>>>>> the
>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>> obvious use case to many users.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31
PM, Matei Zaharia <
>>>>>>>>>>>> matei.zaharia@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> While it might be possible
to move this concept to Spark
>>>>>>>>>>>>>>>> Core
>>>>>>>>>>>>>> long-term,
>>>>>>>>>>>>>>> supporting structured data efficiently
does require
>> quite a
>>>>> bit
>>>>>>>>>>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> infrastructure in Spark SQL,
such as query planning and
>>>>> columnar
>>>>>>>>>>>>> storage.
>>>>>>>>>>>>>>> The intent of Spark SQL though
is to be more than a SQL
>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>> --
>>>>>>>>>>>> it's
>>>>>>>>>>>>>>> meant to be a library for manipulating
structured data.
>>>>>>>>>>>>>>> Since
>>>>>>>>>>>>>>> this
>>>>>>>>>>>> is
>>>>>>>>>>>>>>> possible to build over the core
API, it's pretty natural
>> to
>>>>>>>>>>>> organize it
>>>>>>>>>>>>>>> that way, same as Spark Streaming
is a library.
>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26
PM, Koert Kuipers <
>>>>> koert@tresata.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> "The context is that
SchemaRDD is becoming a common
>> data
>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>> used
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> bringing data into Spark
from external systems, and
>> used
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>> components of Spark,
e.g. MLlib's new pipeline API."
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> i agree. this to me also
implies it belongs in spark
>>>>>>>>>>>>>>>>> core,
>>>>> not
>>>>>>>>>>>> sql
>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015
at 6:11 PM, Michael Malak <
>>>>>>>>>>>>>>>>> michaelmalak@yahoo.com.invalid>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And in the off chance
that anyone hasn't seen it yet,
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> Jan.
>>>>>>>>>>>> 13
>>>>>>>>>>>>> Bay
>>>>>>>>>>>>>>> Area
>>>>>>>>>>>>>>>>>> Spark Meetup YouTube
contained a wealth of background
>>>>>>>>>>>> information
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> idea (mostly from
Patrick and Reynold :-).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>>>> From: Patrick Wendell
<pwendell@gmail.com>
>>>>>>>>>>>>>>>>>> To: Reynold Xin <rxin@databricks.com>
>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org"
<dev@spark.apache.org>
>>>>>>>>>>>>>>>>>> Sent: Monday, January
26, 2015 4:01 PM
>>>>>>>>>>>>>>>>>> Subject: Re: renaming
SchemaRDD -> DataFrame
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing potentially
not clear from this e-mail,
>> there
>>>>> will
>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>> a
>>>>>>>>>>>>> 1:1
>>>>>>>>>>>>>>>>>> correspondence where
you can get an RDD to/from a
>>>>> DataFrame.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015
at 2:18 PM, Reynold Xin <
>>>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We are considering
renaming SchemaRDD -> DataFrame in
>>>>>>>>>>>>>>>>>>> 1.3,
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> wanted
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> get the community's
opinion.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The context is
that SchemaRDD is becoming a common
>> data
>>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>> used
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> bringing data
into Spark from external systems, and
>>>>>>>>>>>>>>>>>>> used
>>>>> for
>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>> components of
Spark, e.g. MLlib's new pipeline API.
>> We
>>>>> also
>>>>>>>>>>>> expect
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> more users to
be programming directly against
>> SchemaRDD
>>>>> API
>>>>>>>>>>>> rather
>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> core RDD API.
SchemaRDD, through its less commonly
>> used
>>>>> DSL
>>>>>>>>>>>>>> originally
>>>>>>>>>>>>>>>>>>> designed for
writing test cases, always has the
>>>>>>>>>>>>>>>>>>> data-frame
>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning
the API to make the API
>> usable
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There are two
motivations for the renaming:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. DataFrame
seems to be a more self-evident name
>> than
>>>>>>>>>>>> SchemaRDD.
>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame
is actually not going to be an
>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>> anymore
>>>>>>>>>>>>>>> (even
>>>>>>>>>>>>>>>>>>> though it would
contain some RDD functions like map,
>>>>>>>>>>>>>>>>>>> flatMap,
>>>>>>>>>>>>> etc),
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> calling it Schema*RDD*
while it is not an RDD is
>> highly
>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>> Instead.
>>>>>>>>>>>>>>>>>>> DataFrame.rdd
will return the underlying RDD for all
>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>> methods.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> My understanding
is that very few users program
>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>> against
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> SchemaRDD API
at the moment, because they are not
>> well
>>>>>>>>>>>> documented.
>>>>>>>>>>>>>>>>>> However,
>>>>>>>>>>>>>>>>>>> oo maintain backward
compatibility, we can create a
>>>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>> DataFrame
>>>>>>>>>>>>>>>>>>> that is still
named SchemaRDD. This will maintain
>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> Scala. That said,
we will have to update all existing
>>>>>>>>>>>> materials to
>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>> DataFrame rather
than SchemaRDD.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>> For additional commands,
e-mail:
>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>>>> For additional commands,
e-mail:
>>>>>>>>>>>>>>>>>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>> For additional commands, e-mail:
>> dev-help@spark.apache.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message