spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Huaxin Gao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable
Date Fri, 13 Oct 2017 22:32:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204281#comment-16204281
] 

Huaxin Gao commented on SPARK-22271:
------------------------------------

I looked the code, in Average.scala, it has
```
  override lazy val evaluateExpression = child.dataType match {
    case DecimalType.Fixed(p, s) =>
      // increase the precision and scale to prevent precision loss
      val dt = DecimalType.bounded(p + 14, s + 4)
      Cast(Cast(sum, dt) / Cast(count, dt), resultType)
    ......
  }
```
When using Shafique's test data, dt has precision 38 and scale 36. count is 299. Cast(count,
dt) will set the scale to 36 and precision to 39, this will cause overflow. 
I have a fix and will submit a PR soon. 

> Describe results in "null" for the value of "mean" of a numeric variable
> ------------------------------------------------------------------------
>
>                 Key: SPARK-22271
>                 URL: https://issues.apache.org/jira/browse/SPARK-22271
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>         Environment: 
>            Reporter: Shafique Jamal
>            Priority: Minor
>         Attachments: decimalNumbers.zip
>
>
> Please excuse me if this issue was addressed already - I was unable to find it.
> Calling .describe().show() on my dataframe results in a value of null for the row "mean":
> {noformat}
> val foo = spark.read.parquet("decimalNumbers.parquet")        
> foo.select(col("numericvariable")).describe().show()
> foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)]
> +-------+--------------------+
> |summary|     numericvariable|
> +-------+--------------------+
> |  count|                 299|
> |   mean|                null|
> | stddev|  0.2376438793946738|
> |    min|0.037815489727642...|
> |    max|2.138189366554511...|
> {noformat}
> But all of the rows for this seem ok (I can attache a parquet file). When I round the
column, however, all is fine:
> {noformat}
> foo.select(bround(col("numericvariable"), 31)).describe().show()
> +-------+---------------------------+
> |summary|bround(numericvariable, 31)|
> +-------+---------------------------+
> |  count|                        299|
> |   mean|       0.139522503183236...|
> | stddev|         0.2376438793946738|
> |    min|       0.037815489727642...|
> |    max|       2.138189366554511...|
> +-------+---------------------------+
> {noformat}
> Rounding using 32 gives null also though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message