spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Reiling (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow
Date Fri, 04 Aug 2017 11:20:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114255#comment-16114255
] 

Stephan Reiling edited comment on SPARK-21595 at 8/4/17 11:19 AM:
------------------------------------------------------------------

I have tried out a couple of settings for spark.sql.windowExec.buffer.spill.threshold and
I have now settled on 4M as the default for it in my workflows. This gives about the same
behavior as spark 2.1. But this is dependent on the amount of spark memory and the size of
the rows in the dataframe.
I am not in favor of introducing another threshold for this. If the spilling is delayed, but
then happens with the low threshold of 4096 rows, in my case this would still spill 110k files
to disk and potentially cause a "too many open files" exception (right ?).
Just looking at the spilling behavior, it would be better if the value would not specify the
number of rows, but the amount of memory. So instead of 4096 rows, it would specify 500MB
of memory, and then spill chunks of 500MB to disk. How many rows this is would change case
by case.


was (Author: sreiling):
I have tried out a couple of settings for spark.sql.windowExec.buffer.spill.threshold and
I have now settled on 4M as the default for it in my work flows. This gives about the same
behavior as spark 2.1. But this is dependent on the amount of spark memory and the size of
the rows in the dataframe.
I am not in favor of introducing another threshold for this. If the spilling is delayed, but
then happens with the low threshold of 4096 rows, in my case this would still spill 110k files
to disk and potentially cause a "too many open files" exception (right ?).
Just looking at the spilling behavior, it would be better if the value would not specify the
number of rows, but the amount of memory. So instead of 4096 rows, it would specify 500MB
of memory, and then spill chunks of 500MB to disk. How many rows this is would change case
by case.

> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing
workflow
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-21595
>                 URL: https://issues.apache.org/jira/browse/SPARK-21595
>             Project: Spark
>          Issue Type: Bug
>          Components: Documentation, PySpark
>    Affects Versions: 2.2.0
>         Environment: pyspark on linux
>            Reporter: Stephan Reiling
>            Priority: Minor
>              Labels: documentation, regression
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
>         'association_idx',
>         sqlf.row_number().over(
>             Window.orderBy('uid1', 'uid2')
>         )
>     )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates one large
window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with out of memory
exception or too many open files exception, depending on memory settings (which is what I
tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 creates >110,000
files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords
crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 64.1 MB
to disk (0  time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords
crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 64.1 MB
to disk (1  time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I had missed
was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096
rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue. 
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold	2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a performance
hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable should be included
in the configuration documentation. 
> Maybe 4096 is too small of a default value?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message