From reviews-return-599933-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Mon Jan  8 06:23:49 2018
Return-Path: <reviews-return-599933-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id 89DC2180654
	for <archive-asf-public@eu.ponee.io>; Mon,  8 Jan 2018 06:23:49 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id 79A6E160C3D; Mon,  8 Jan 2018 05:23:49 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id E5EEE160C2A
	for <archive-asf-public@cust-asf.ponee.io>; Mon,  8 Jan 2018 06:23:48 +0100 (CET)
Received: (qmail 11201 invoked by uid 500); 8 Jan 2018 05:23:48 -0000
Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:reviews-help@spark.apache.org>
List-Unsubscribe: <mailto:reviews-unsubscribe@spark.apache.org>
List-Post: <mailto:reviews@spark.apache.org>
List-Id: <reviews.spark.apache.org>
Delivered-To: mailing list reviews@spark.apache.org
Received: (qmail 11190 invoked by uid 99); 8 Jan 2018 05:23:47 -0000
Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jan 2018 05:23:47 +0000
Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33)
	id BA8D4DFC3E; Mon,  8 Jan 2018 05:23:47 +0000 (UTC)
From: CodingCat <git@git.apache.org>
To: reviews@spark.apache.org
Reply-To: reviews@spark.apache.org
References: <git-pr-20072-spark@git.apache.org>
In-Reply-To: <git-pr-20072-spark@git.apache.org>
Subject: [GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d...
Content-Type: text/plain
Message-Id: <20180108052347.BA8D4DFC3E@git1-us-west.apache.org>
Date: Mon,  8 Jan 2018 05:23:47 +0000 (UTC)

Github user CodingCat commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20072#discussion_r160076999
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -263,6 +263,17 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val DISK_TO_MEMORY_SIZE_FACTOR = buildConf(
    +    "spark.sql.sources.compressionFactor")
    +    .internal()
    +    .doc("The result of multiplying this factor with the size of data source files is propagated " +
    +      "to serve as the stats to choose the best execution plan. In the case where the " +
    +      "in-disk and in-memory size of data is significantly different, users can adjust this " +
    +      "factor for a better choice of the execution plan. The default value is 1.0.")
    +    .doubleConf
    +    .checkValue(_ > 0, "the value of fileDataSizeFactor must be larger than 0")
    --- End diff --
    
    it's not necessary to be that parquet is always smaller than memory size...e.g. in some simple dataset (like the one used in the test), parquet's overhead makes the overall size larger than in-memory size....
    
    but with TPCDS dataset, I observed that parquet size is much smaller than in-memory size


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org