Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Sat, 4 Nov 2017 18:13:01 +0000 (UTC)
From: "Xiao Li (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13115441.1509563526000.151367.1509819181048@Atlassian.JIRA>
In-Reply-To: <JIRA.13115441.1509563526000@Atlassian.JIRA>
References: <JIRA.13115441.1509563526000@Atlassian.JIRA> <JIRA.13115441.1509563526238@jira-lw-us.apache.org>
Subject: [jira] [Reopened] (SPARK-22411) Heuristic to combine splits in
 DataSourceScanExec isn't accurate when dynamic allocation is enabled
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Sat, 04 Nov 2017 18:13:07 -0000


     [ https://issues.apache.org/jira/browse/SPARK-22411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiao Li reopened SPARK-22411:
-----------------------------

> Heuristic to combine splits in DataSourceScanExec isn't accurate when dynamic allocation is enabled
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22411
>                 URL: https://issues.apache.org/jira/browse/SPARK-22411
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Vinitha Reddy Gankidi
>            Assignee: Vinitha Reddy Gankidi
>            Priority: Major
>             Fix For: 2.3.0
>
>
> The heuristic to calculate the maxSplitSize in DataSourceScanExec is as follows:
> https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L431
> Default parallelism in this case is the number of total cores of all the registered executors for this application. This works well with static allocation but with dynamic allocation enabled, this value is usually one (with default config of min and initial executors as zero) at the time of split calculation. This heuristic was introduced in SPARK-14582. 
> When Dynamic allocation it is confusing to tune the split size with this heuristic. It is better to ignore bytesPerCore and use the values of 'spark.sql.files.maxPartitionBytes' as the max split size. 


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org