From issues-return-252964-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Mon May  4 13:16:04 2020
Return-Path: <issues-return-252964-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 9B21F180667
	for <archive-asf-public@cust-asf.ponee.io>; Mon,  4 May 2020 15:16:02 +0200 (CEST)
Received: (qmail 4018 invoked by uid 500); 4 May 2020 13:16:02 -0000
Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@spark.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@spark.apache.org>
List-Post: <mailto:issues@spark.apache.org>
List-Id: <issues.spark.apache.org>
Delivered-To: mailing list issues@spark.apache.org
Received: (qmail 3937 invoked by uid 99); 4 May 2020 13:16:01 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2020 13:16:01 +0000
Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C0679E2FAF
	for <issues@spark.apache.org>; Mon,  4 May 2020 13:16:00 +0000 (UTC)
Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 20392780360
	for <issues@spark.apache.org>; Mon,  4 May 2020 13:16:00 +0000 (UTC)
Date: Mon, 4 May 2020 13:16:00 +0000 (UTC)
From: "George George (Jira)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13302598.1588580530000.90472.1588598160131@Atlassian.JIRA>
In-Reply-To: <JIRA.13302598.1588580530000@Atlassian.JIRA>
References: <JIRA.13302598.1588580530000@Atlassian.JIRA> <JIRA.13302598.1588580530620@jira-he-de>
Subject: [jira] [Updated] (SPARK-31635) Spark SQL Sort fails when sorting
 big data points
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/SPARK-31635?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:all-tabpanel ]

George George updated SPARK-31635:
----------------------------------
    Description:=20
=C2=A0Please have a look at the example below:=C2=A0
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Nested(=
a,Seq.fill[Point](250000)(Point(1,2)))), 100)
test.toDF().as[Nested].sort("a").take(1)
{code}
=C2=A0*Sorting* big data objects using *Spark Dataframe* is failing with fo=
llowing exception:=C2=A0
{code:java}
2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized resu=
lts of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize (100.0=
 MB)
[Stage 0:=3D=3D=3D=3D=3D=3D>                                               =
  (12 + 3) / 100]org.apache.spark.SparkException: Job aborted due to stage =
failure: Total size of serialized results of 13 tasks (100.1 MB) is bigger =
than spark.driver.maxResu
{code}
However using the *RDD API* is working and no exception is thrown:=C2=A0
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Nested(=
a,Seq.fill[Point](250000)(Point(1,2)))), 100)
test.sortBy(_.a).take(1)
{code}
For both code snippets we started the spark shell with exactly the same arg=
uments:
{code:java}
spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=3D100MB"
{code}
Even if we increase the=C2=A0spark.driver.maxResultSize, the executors stil=
l get killed for our use case. The interesting thing is that when using the=
 RDD API directly the problem is not there. *Looks like there is a bug in d=
ataframe sort because is shuffling to much data to the driver?*=C2=A0

Note: this is a small example and I reduced the=C2=A0spark.driver.maxResult=
Size to a smaller size, but in our application I've tried setting it to 8GB=
 but as mentioned above the job was killed.=C2=A0

=C2=A0

  was:
=C2=A0Please have a look at the example below:=C2=A0
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Nested(=
a,Seq.fill[Point](250000)(Point(1,2)))), 100)
test.toDF().as[Nested].sort("a").take(1)
{code}
=C2=A0*Sorting* big data objects using *Spark Dataframe* is failing with fo=
llowing exception:=C2=A0
{code:java}
2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized resu=
lts of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize (100.0=
 MB)
[Stage 0:=3D=3D=3D=3D=3D=3D>                                               =
  (12 + 3) / 100]org.apache.spark.SparkException: Job aborted due to stage =
failure: Total size of serialized results of 13 tasks (100.1 MB) is bigger =
than spark.driver.maxResu
{code}
However using the *RDD API* is working and no exception is thrown:=C2=A0
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Nested(=
a,Seq.fill[Point](250000)(Point(1,2)))), 100)
test.sortBy(_.a).take(1)
{code}
For both code snippets we started the spark shell with exactly the same arg=
uments:
{code:java}
spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=3D100MB"
{code}
Even if we increase the=C2=A0spark.driver.maxResultSize, the executors stil=
l get killed for our use case. The interesting thing is that when using the=
 RDD API directly the problem is not there. *Looks like there is a bug in d=
ataframe sort because is shuffling to much data too the driver?*=C2=A0

Note: this is a small example and I reduced the=C2=A0spark.driver.maxResult=
Size to a smaller size, but in our application I've tried setting it to 8GB=
 but as mentioned above the job was killed.=C2=A0

=C2=A0


> Spark SQL Sort fails when sorting big data points
> -------------------------------------------------
>
>                 Key: SPARK-31635
>                 URL: https://issues.apache.org/jira/browse/SPARK-31635
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.2
>            Reporter: George George
>            Priority: Major
>
> =C2=A0Please have a look at the example below:=C2=A0
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Neste=
d(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
> test.toDF().as[Nested].sort("a").take(1)
> {code}
> =C2=A0*Sorting* big data objects using *Spark Dataframe* is failing with =
following exception:=C2=A0
> {code:java}
> 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized re=
sults of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize (100=
.0 MB)
> [Stage 0:=3D=3D=3D=3D=3D=3D>                                             =
    (12 + 3) / 100]org.apache.spark.SparkException: Job aborted due to stag=
e failure: Total size of serialized results of 13 tasks (100.1 MB) is bigge=
r than spark.driver.maxResu
> {code}
> However using the *RDD API* is working and no exception is thrown:=C2=A0
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Neste=
d(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
> test.sortBy(_.a).take(1)
> {code}
> For both code snippets we started the spark shell with exactly the same a=
rguments:
> {code:java}
> spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=3D100MB=
"
> {code}
> Even if we increase the=C2=A0spark.driver.maxResultSize, the executors st=
ill get killed for our use case. The interesting thing is that when using t=
he RDD API directly the problem is not there. *Looks like there is a bug in=
 dataframe sort because is shuffling to much data to the driver?*=C2=A0
> Note: this is a small example and I reduced the=C2=A0spark.driver.maxResu=
ltSize to a smaller size, but in our application I've tried setting it to 8=
GB but as mentioned above the job was killed.=C2=A0
> =C2=A0


--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org