spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: Structured streaming use of DataFrame vs Datasource
Date Thu, 16 Jun 2016 21:22:56 GMT
Its not throwing away any information from the point of view of the SQL
optimizer. The schema preserves all the type information that the catalyst
uses. The type information T in Dataset[T] is only used at the API level to
ensure compilation-time type checks of the user program.

On Thu, Jun 16, 2016 at 2:05 PM, Cody Koeninger <cody@koeninger.org> wrote:

> I'm clear on what a type alias is.  My question is more that moving
> from e.g. Dataset[T] to Dataset[Row] involves throwing away
> information.  Reading through code that uses the Dataframe alias, it's
> a little hard for me to know when that's intentional or not.
>
>
> On Thu, Jun 16, 2016 at 2:50 PM, Tathagata Das
> <tathagata.das1565@gmail.com> wrote:
> > There are different ways to view this. If its confusing to think that
> Source
> > API returning DataFrames, its equivalent to thinking that you are
> returning
> > a Dataset[Row], and DataFrame is just a shorthand. And
> > DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object]
> is
> > to Array[String]. DataFrame is more general in a way, as every Dataset
> can
> > be boiled down to a DataFrame. So to keep the Source APIs general (and
> also
> > source-compatible with 1.x), they return DataFrame.
> >
> > On Thu, Jun 16, 2016 at 12:38 PM, Cody Koeninger <cody@koeninger.org>
> wrote:
> >>
> >> Is this really an internal / external distinction?
> >>
> >> For a concrete example, Source.getBatch seems to be a public
> >> interface, but returns DataFrame.
> >>
> >> On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
> >> <tathagata.das1565@gmail.com> wrote:
> >> > DataFrame is a type alias of Dataset[Row], so externally it seems like
> >> > Dataset is the main type and DataFrame is a derivative type.
> >> > However, internally, since everything is processed as Rows, everything
> >> > uses
> >> > DataFrames, Type classes used in a Dataset is internally converted to
> >> > rows
> >> > for processing. . Therefore internally DataFrame is like "main" type
> >> > that is
> >> > used.
> >> >
> >> > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <cody@koeninger.org>
> >> > wrote:
> >> >>
> >> >> Sorry, meant DataFrame vs Dataset
> >> >>
> >> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <cody@koeninger.org
> >
> >> >> wrote:
> >> >> > Is there a principled reason why sql.streaming.* and
> >> >> > sql.execution.streaming.* are making extensive use of DataFrame
> >> >> > instead of Datasource?
> >> >> >
> >> >> > Or is that just a holdover from code written before the move /
type
> >> >> > alias?
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> >> For additional commands, e-mail: dev-help@spark.apache.org
> >> >>
> >> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
View raw message