spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Don Drake (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
Date Tue, 07 Feb 2017 16:29:41 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Don Drake reopened SPARK-19477:
-------------------------------

I'm struggling with this answer.

I thought the point of Datasets, was to have a strongly typed definition, rather than the
more loosely defined Dataframe.

Why does it matter if I use relational or typed methods to access it?

It works if I call a map() against it:

{code}
scala> ds.map(x => x).take(1)
res7: Array[F] = Array(F(a,b,c))
{code}

But the real problem I'm having is that when I attempt to save the Dataset, the schema is
ignored:

{code}
scala> ds.write.parquet("a")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

scala> val ds2 = spark.read.parquet("a").as[F]
ds2: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields]

scala> ds2.printSchema
root
 |-- f1: string (nullable = true)
 |-- f2: string (nullable = true)
 |-- f3: string (nullable = true)
 |-- c4: string (nullable = true)
{code}

IMHO, the c4 column should not have been saved.



> [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-19477
>                 URL: https://issues.apache.org/jira/browse/SPARK-19477
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, the columns
not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", "j","z")).toDF("f1",
"f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", "j","z")).toDF("f1",
"f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: string, f3[0]:
string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true),
StructField(f2,StringType,true), StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true),
StructField(f2,StringType,true), StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message