From reviews-return-599933-archive-asf-public=cust-asf.ponee.io@spark.apache.org Mon Jan 8 06:23:49 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 89DC2180654 for ; Mon, 8 Jan 2018 06:23:49 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 79A6E160C3D; Mon, 8 Jan 2018 05:23:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E5EEE160C2A for ; Mon, 8 Jan 2018 06:23:48 +0100 (CET) Received: (qmail 11201 invoked by uid 500); 8 Jan 2018 05:23:48 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 11190 invoked by uid 99); 8 Jan 2018 05:23:47 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jan 2018 05:23:47 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id BA8D4DFC3E; Mon, 8 Jan 2018 05:23:47 +0000 (UTC) From: CodingCat To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org References: In-Reply-To: Subject: [GitHub] spark pull request #20072: [SPARK-22790][SQL] add a configurable factor to d... Content-Type: text/plain Message-Id: <20180108052347.BA8D4DFC3E@git1-us-west.apache.org> Date: Mon, 8 Jan 2018 05:23:47 +0000 (UTC) Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/20072#discussion_r160076999 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -263,6 +263,17 @@ object SQLConf { .booleanConf .createWithDefault(false) + val DISK_TO_MEMORY_SIZE_FACTOR = buildConf( + "spark.sql.sources.compressionFactor") + .internal() + .doc("The result of multiplying this factor with the size of data source files is propagated " + + "to serve as the stats to choose the best execution plan. In the case where the " + + "in-disk and in-memory size of data is significantly different, users can adjust this " + + "factor for a better choice of the execution plan. The default value is 1.0.") + .doubleConf + .checkValue(_ > 0, "the value of fileDataSizeFactor must be larger than 0") --- End diff -- it's not necessary to be that parquet is always smaller than memory size...e.g. in some simple dataset (like the one used in the test), parquet's overhead makes the overall size larger than in-memory size.... but with TPCDS dataset, I observed that parquet size is much smaller than in-memory size --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org