spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CodingCat <...@git.apache.org>
Subject [GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...
Date Mon, 08 Jan 2018 05:23:47 GMT
Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160076999
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -263,6 +263,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "spark.sql.sources.compressionFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is
propagated " +
    +      "to serve as the stats to choose the best execution plan. In the case where the
" +
    +      "in-disk and in-memory size of data is significantly different, users can adjust
this " +
    +      "factor for a better choice of the execution plan. The default value is 1.0.")
    +    .doubleConf
    +    .checkValue(_ > 0, "the value of fileDataSizeFactor must be larger than 0")
    --- End diff --
    
    it's not necessary to be that parquet is always smaller than memory size...e.g. in some
simple dataset (like the one used in the test), parquet's overhead makes the overall size
larger than in-memory size....
    
    but with TPCDS dataset, I observed that parquet size is much smaller than in-memory size


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message