spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] attilapiros commented on a change in pull request #26016: [SPARK-24914][SQL] New statistic to improve data size estimate for columnar storage formats
Date Wed, 20 Nov 2019 19:43:15 GMT
attilapiros commented on a change in pull request #26016: [SPARK-24914][SQL] New statistic
to improve data size estimate for columnar storage formats
URL: https://github.com/apache/spark/pull/26016#discussion_r348708931
 
 

 ##########
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
 ##########
 @@ -1186,10 +1186,17 @@ private[hive] object HiveClientImpl {
     // return None.
     // In Hive, when statistics gathering is disabled, `rawDataSize` and `numRows` is always
     // zero after INSERT command. So they are used here only if they are larger than zero.
+    val deserFactor = properties.get(STATISTICS_DESER_FACTOR).map(_.toInt)
     if (totalSize.isDefined && totalSize.get > 0L) {
-      Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount = rowCount.filter(_ >
0)))
+      Some(CatalogStatistics(
+        sizeInBytes = totalSize.get,
+        deserFactor = deserFactor,
+        rowCount = rowCount.filter(_ > 0)))
     } else if (rawDataSize.isDefined && rawDataSize.get > 0) {
-      Some(CatalogStatistics(sizeInBytes = rawDataSize.get, rowCount = rowCount.filter(_
> 0)))
+      Some(CatalogStatistics(
+        sizeInBytes = rawDataSize.get,
 
 Review comment:
   In this case (when only `rawDataSize` is defined) I will set the `deserFactor` to `None`
to  avoid the extra scaling as `rawDataSize` is already the "approximate size of data in memory".

   
   The Hive 1.2 value you are referring to is probably a hive bug.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message