Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Sat, 8 Apr 2017 10:17:41 +0000 (UTC)
From: "teobar (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12931468.1452865510000.240944.1491646661795@Atlassian.JIRA>
In-Reply-To: <JIRA.12931468.1452865510000@Atlassian.JIRA>
References: <JIRA.12931468.1452865510000@Atlassian.JIRA> <JIRA.12931468.1452865510139@jira-lw-us.apache.org>
Subject: [jira] [Commented] (SPARK-12837) Spark driver requires large memory
 space for serialized results even there are no data collected to the driver
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Sat, 08 Apr 2017 10:17:47 -0000


    [ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961762#comment-15961762 ] 

teobar commented on SPARK-12837:
--------------------------------

Sorry for not posting this earlier, have forgot my password and didn't go through the recovery steps.
Anyway,  the workaround I used in 1.6 was to set the following extra settings when submitting my spark application:
{code}
--conf spark.driver.maxResultSize=0
--driver-memory 10g
{code}


> Spark driver requires large memory space for serialized results even there are no data collected to the driver
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12837
>                 URL: https://issues.apache.org/jira/browse/SPARK-12837
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.0
>            Reporter: Tien-Dung LE
>            Assignee: Wenchen Fan
>            Priority: Critical
>             Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 393 tasks (1025.9 KB) is bigger than spark.driver.maxResultSize (1024.0 KB)
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org