spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herman van Hovell (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (SPARK-18358) Multiple Aggregation Using 'countDistinct' and 'first' result in error
Date Tue, 22 Nov 2016 16:13:58 GMT

     [ https://issues.apache.org/jira/browse/SPARK-18358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Herman van Hovell closed SPARK-18358.
-------------------------------------
       Resolution: Duplicate
    Fix Version/s: 2.0.2

> Multiple Aggregation Using 'countDistinct' and 'first' result in error 
> -----------------------------------------------------------------------
>
>                 Key: SPARK-18358
>                 URL: https://issues.apache.org/jira/browse/SPARK-18358
>             Project: Spark
>          Issue Type: Bug
>         Environment: Mac OS X 10.9.5
> Apache Spark 2.0.1
> Hadoop 1.4
>            Reporter: Chris Nasrallah
>             Fix For: 2.0.2
>
>
> Using pyspark, when I attempt to perform multiple aggregations on the same groupBy object
using the functions 'first' and 'countDistinct' it results in a Py4JJavaError.
> {code:borderStyle=solid}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as sfn
> sparkSession = SparkSession.builder.master('local').getOrCreate()
> df = spark.createDataFrame([
>         (1, 'a', 'z'),
>         (1, 'b', 'x'),
>         (1, 'a', 'y'),
>         (1, 'a', 'x'),
>         (2, 'b', 'z'),
>         (2, 'b', 'z')
>     ], ['id', 'var1', 'var2'])
> ## Using two 'first' and one 'countDistinct' aggregations works
> df.groupby('id')    \
>         .agg(sfn.first('var1'),  \
>                 sfn.first('var2'),  \
>                 sfn.countDistinct('var1')).show()
>                          
> ## Using one 'max' with both 'countDistinct' works:
> df.groupby('id')    \
>          .agg(sfn.max('var2'),                \
>                  sfn.countDistinct('var1'),   \
>                  sfn.countDistinct('var2')).show()
> ## But using both 'countDistinct' with at least one 'first' crashes
> df.groupby('id')    \
>         .agg(sfn.first('var1'),   \
>                 sfn.first('var2'),   \
>                 sfn.countDistinct('var1'), \
>                 sfn.countDistinct('var2')) \
>         .show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message