spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] skambha commented on pull request #29125: [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value
Date Thu, 06 Aug 2020 21:54:47 GMT

skambha commented on pull request #29125:
URL: https://github.com/apache/spark/pull/29125#issuecomment-670210823


   IIUC, The solutions you mention were also discussed earlier and were not accepted by you.
If you do not want to revert this  backport, then I hope you agree it is critical to fix it
so users do not run into this incorrectness issue.  Please feel free to go ahead with the
option you prefer.  
   
   I have expressed the issues and will summarize them below and also put it in the JIRA.
 
   
   The important issue is we should not return incorrect results.  In general, it is not a
good practice to back port a change to a stable branch and cause more queries to return incorrect
results.
   
   Just to reiterate:
   
   1. This current PR that has back ported the UnsafeRow fix causes queries to return incorrect
results.  This is for v2.4.x and v3.0.x line.   This change by itself has unsafe side effects
and results in incorrect results being returned.   
   2. It does not matter whether you have whole stage on or off, ansi on or off, you will
get more queries returning incorrect results.
   ``` 
   
   scala> val decStr = "1" + "0" * 19
   decStr: String = 10000000000000000000
   
   scala> val d3 = spark.range(0, 1, 1, 1).union(spark.range(0, 11, 1, 1))
   d3: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala>  val d5 = d3.select(expr(s"cast('$decStr' as decimal (38, 18)) as d"),lit(1).as("key")).groupBy("key").agg(sum($"d").alias("sumd")).select($"sumd")
   d5: org.apache.spark.sql.DataFrame = [sumd: decimal(38,18)]
   
   scala> d5.show(false)   <----  INCORRECT RESULTS RETURNED
   +---------------------------------------+
   |sumd                                   |
   +---------------------------------------+
   |20000000000000000000.000000000000000000|
   +---------------------------------------+
   
   ```
   3.  Incorrect results is very serious and it is not good for Spark users to run into it
for common operations like sum.
      


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message