From issues-return-252964-archive-asf-public=cust-asf.ponee.io@spark.apache.org Mon May 4 13:16:04 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 9B21F180667 for ; Mon, 4 May 2020 15:16:02 +0200 (CEST) Received: (qmail 4018 invoked by uid 500); 4 May 2020 13:16:02 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 3937 invoked by uid 99); 4 May 2020 13:16:01 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2020 13:16:01 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C0679E2FAF for ; Mon, 4 May 2020 13:16:00 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 20392780360 for ; Mon, 4 May 2020 13:16:00 +0000 (UTC) Date: Mon, 4 May 2020 13:16:00 +0000 (UTC) From: "George George (Jira)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SPARK-31635) Spark SQL Sort fails when sorting big data points MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-31635?page=3Dcom.atlassi= an.jira.plugin.system.issuetabpanels:all-tabpanel ] George George updated SPARK-31635: ---------------------------------- Description:=20 =C2=A0Please have a look at the example below:=C2=A0 {code:java} case class Point(x:Double, y:Double) case class Nested(a: Long, b: Seq[Point]) val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Nested(= a,Seq.fill[Point](250000)(Point(1,2)))), 100) test.toDF().as[Nested].sort("a").take(1) {code} =C2=A0*Sorting* big data objects using *Spark Dataframe* is failing with fo= llowing exception:=C2=A0 {code:java} 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized resu= lts of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize (100.0= MB) [Stage 0:=3D=3D=3D=3D=3D=3D> = (12 + 3) / 100]org.apache.spark.SparkException: Job aborted due to stage = failure: Total size of serialized results of 13 tasks (100.1 MB) is bigger = than spark.driver.maxResu {code} However using the *RDD API* is working and no exception is thrown:=C2=A0 {code:java} case class Point(x:Double, y:Double) case class Nested(a: Long, b: Seq[Point]) val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Nested(= a,Seq.fill[Point](250000)(Point(1,2)))), 100) test.sortBy(_.a).take(1) {code} For both code snippets we started the spark shell with exactly the same arg= uments: {code:java} spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=3D100MB" {code} Even if we increase the=C2=A0spark.driver.maxResultSize, the executors stil= l get killed for our use case. The interesting thing is that when using the= RDD API directly the problem is not there. *Looks like there is a bug in d= ataframe sort because is shuffling to much data to the driver?*=C2=A0 Note: this is a small example and I reduced the=C2=A0spark.driver.maxResult= Size to a smaller size, but in our application I've tried setting it to 8GB= but as mentioned above the job was killed.=C2=A0 =C2=A0 was: =C2=A0Please have a look at the example below:=C2=A0 {code:java} case class Point(x:Double, y:Double) case class Nested(a: Long, b: Seq[Point]) val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Nested(= a,Seq.fill[Point](250000)(Point(1,2)))), 100) test.toDF().as[Nested].sort("a").take(1) {code} =C2=A0*Sorting* big data objects using *Spark Dataframe* is failing with fo= llowing exception:=C2=A0 {code:java} 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized resu= lts of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize (100.0= MB) [Stage 0:=3D=3D=3D=3D=3D=3D> = (12 + 3) / 100]org.apache.spark.SparkException: Job aborted due to stage = failure: Total size of serialized results of 13 tasks (100.1 MB) is bigger = than spark.driver.maxResu {code} However using the *RDD API* is working and no exception is thrown:=C2=A0 {code:java} case class Point(x:Double, y:Double) case class Nested(a: Long, b: Seq[Point]) val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Nested(= a,Seq.fill[Point](250000)(Point(1,2)))), 100) test.sortBy(_.a).take(1) {code} For both code snippets we started the spark shell with exactly the same arg= uments: {code:java} spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=3D100MB" {code} Even if we increase the=C2=A0spark.driver.maxResultSize, the executors stil= l get killed for our use case. The interesting thing is that when using the= RDD API directly the problem is not there. *Looks like there is a bug in d= ataframe sort because is shuffling to much data too the driver?*=C2=A0 Note: this is a small example and I reduced the=C2=A0spark.driver.maxResult= Size to a smaller size, but in our application I've tried setting it to 8GB= but as mentioned above the job was killed.=C2=A0 =C2=A0 > Spark SQL Sort fails when sorting big data points > ------------------------------------------------- > > Key: SPARK-31635 > URL: https://issues.apache.org/jira/browse/SPARK-31635 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.2 > Reporter: George George > Priority: Major > > =C2=A0Please have a look at the example below:=C2=A0 > {code:java} > case class Point(x:Double, y:Double) > case class Nested(a: Long, b: Seq[Point]) > val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Neste= d(a,Seq.fill[Point](250000)(Point(1,2)))), 100) > test.toDF().as[Nested].sort("a").take(1) > {code} > =C2=A0*Sorting* big data objects using *Spark Dataframe* is failing with = following exception:=C2=A0 > {code:java} > 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized re= sults of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize (100= .0 MB) > [Stage 0:=3D=3D=3D=3D=3D=3D> = (12 + 3) / 100]org.apache.spark.SparkException: Job aborted due to stag= e failure: Total size of serialized results of 13 tasks (100.1 MB) is bigge= r than spark.driver.maxResu > {code} > However using the *RDD API* is working and no exception is thrown:=C2=A0 > {code:java} > case class Point(x:Double, y:Double) > case class Nested(a: Long, b: Seq[Point]) > val test =3D spark.sparkContext.parallelize((1L to 100L).map(a =3D> Neste= d(a,Seq.fill[Point](250000)(Point(1,2)))), 100) > test.sortBy(_.a).take(1) > {code} > For both code snippets we started the spark shell with exactly the same a= rguments: > {code:java} > spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=3D100MB= " > {code} > Even if we increase the=C2=A0spark.driver.maxResultSize, the executors st= ill get killed for our use case. The interesting thing is that when using t= he RDD API directly the problem is not there. *Looks like there is a bug in= dataframe sort because is shuffling to much data to the driver?*=C2=A0 > Note: this is a small example and I reduced the=C2=A0spark.driver.maxResu= ltSize to a smaller size, but in our application I've tried setting it to 8= GB but as mentioned above the job was killed.=C2=A0 > =C2=A0 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org