spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Armbrust (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema
Date Fri, 19 Dec 2014 21:04:13 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254016#comment-14254016
] 

Michael Armbrust commented on SPARK-4564:
-----------------------------------------

It is however consistent with SQL, where GROUP BY expression are only included if they are
part of the SELECT clause.  Since the goal here is to provide programatic SQL I'm inclined
to stick with the current semantics.  Changing this would also be a fairly major breaking
change to the API if people were dependent on the position of columns in the result.

> SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as
part of the output schema
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4564
>                 URL: https://issues.apache.org/jira/browse/SPARK-4564
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.1.0
>         Environment: Mac OSX, local mode, but should hold true for all environments
>            Reporter: Dean Wampler
>
> In the following example, I would expect the "grouped" schema to contain two fields,
the String name and the Long count, but it only contains the Long count.
> {code}
> // Assumes val sc = new SparkContext(...), e.g., in Spark Shell
> import org.apache.spark.sql.{SQLContext, SchemaRDD}
> import org.apache.spark.sql.catalyst.expressions._
> val sqlc = new SQLContext(sc)
> import sqlc._
> case class Record(name: String, n: Int)
> val records = List(
>   Record("three",   1),
>   Record("three",   2),
>   Record("two",     3),
>   Record("three",   4),
>   Record("two",     5))
> val recs = sc.parallelize(records)
> recs.registerTempTable("records")
> val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count)
> grouped.printSchema
> // root
> //  |-- count: long (nullable = false)
> grouped foreach println
> // [2]
> // [3]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message