spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carlos Bribiescas (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union
Date Mon, 23 Oct 2017 15:48:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Carlos Bribiescas updated SPARK-22335:
--------------------------------------
    Priority: Major  (was: Minor)

> Union for DataSet uses column order instead of types for union
> --------------------------------------------------------------
>
>                 Key: SPARK-22335
>                 URL: https://issues.apache.org/jira/browse/SPARK-22335
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually giving the
wrong result. If you try to access the members by name, it will use the order. Heres is a
reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] should be correctly
mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should
be correctly mapped by type, not by column order
>    abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though columns
were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked
issue, slightly modified.  However this seems wrong since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct result, which
is logically inconsistent behavior
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work around is
really fair, since I'm supposed to be getting of type `AB`.  More importantly I think the
issue is bigger when you consider that it happens even if you read from parquet (as you would
expect).  And that its inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So either that
typing could be prioritized to happen before the union or unioning of DF could be done with
column order taken into account.  Again, this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message