spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "George George (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-31635) Spark SQL Sort fails when sorting big data points
Date Mon, 04 May 2020 13:16:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

George George updated SPARK-31635:
----------------------------------
    Description: 
 Please have a look at the example below: 
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))),
100)
test.toDF().as[Nested].sort("a").take(1)
{code}
 *Sorting* big data objects using *Spark Dataframe* is failing with following exception: 
{code:java}
2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized results of 14 tasks
(107.8 MB) is bigger than spark.driver.maxResultSize (100.0 MB)
[Stage 0:======>                                                 (12 + 3) / 100]org.apache.spark.SparkException:
Job aborted due to stage failure: Total size of serialized results of 13 tasks (100.1 MB)
is bigger than spark.driver.maxResu
{code}
However using the *RDD API* is working and no exception is thrown: 
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))),
100)
test.sortBy(_.a).take(1)
{code}
For both code snippets we started the spark shell with exactly the same arguments:
{code:java}
spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB"
{code}
Even if we increase the spark.driver.maxResultSize, the executors still get killed for our
use case. The interesting thing is that when using the RDD API directly the problem is not
there. *Looks like there is a bug in dataframe sort because is shuffling to much data to the
driver?* 

Note: this is a small example and I reduced the spark.driver.maxResultSize to a smaller size,
but in our application I've tried setting it to 8GB but as mentioned above the job was killed. 

 

  was:
 Please have a look at the example below: 
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))),
100)
test.toDF().as[Nested].sort("a").take(1)
{code}
 *Sorting* big data objects using *Spark Dataframe* is failing with following exception: 
{code:java}
2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized results of 14 tasks
(107.8 MB) is bigger than spark.driver.maxResultSize (100.0 MB)
[Stage 0:======>                                                 (12 + 3) / 100]org.apache.spark.SparkException:
Job aborted due to stage failure: Total size of serialized results of 13 tasks (100.1 MB)
is bigger than spark.driver.maxResu
{code}
However using the *RDD API* is working and no exception is thrown: 
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))),
100)
test.sortBy(_.a).take(1)
{code}
For both code snippets we started the spark shell with exactly the same arguments:
{code:java}
spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB"
{code}
Even if we increase the spark.driver.maxResultSize, the executors still get killed for our
use case. The interesting thing is that when using the RDD API directly the problem is not
there. *Looks like there is a bug in dataframe sort because is shuffling to much data too
the driver?* 

Note: this is a small example and I reduced the spark.driver.maxResultSize to a smaller size,
but in our application I've tried setting it to 8GB but as mentioned above the job was killed. 

 


> Spark SQL Sort fails when sorting big data points
> -------------------------------------------------
>
>                 Key: SPARK-31635
>                 URL: https://issues.apache.org/jira/browse/SPARK-31635
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.2
>            Reporter: George George
>            Priority: Major
>
>  Please have a look at the example below: 
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))),
100)
> test.toDF().as[Nested].sort("a").take(1)
> {code}
>  *Sorting* big data objects using *Spark Dataframe* is failing with following exception: 
> {code:java}
> 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized results of 14
tasks (107.8 MB) is bigger than spark.driver.maxResultSize (100.0 MB)
> [Stage 0:======>                                                 (12 + 3) / 100]org.apache.spark.SparkException:
Job aborted due to stage failure: Total size of serialized results of 13 tasks (100.1 MB)
is bigger than spark.driver.maxResu
> {code}
> However using the *RDD API* is working and no exception is thrown: 
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))),
100)
> test.sortBy(_.a).take(1)
> {code}
> For both code snippets we started the spark shell with exactly the same arguments:
> {code:java}
> spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB"
> {code}
> Even if we increase the spark.driver.maxResultSize, the executors still get killed for
our use case. The interesting thing is that when using the RDD API directly the problem is
not there. *Looks like there is a bug in dataframe sort because is shuffling to much data
to the driver?* 
> Note: this is a small example and I reduced the spark.driver.maxResultSize to a smaller
size, but in our application I've tried setting it to 8GB but as mentioned above the job was
killed. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message