spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tejas Patil (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow
Date Thu, 03 Aug 2017 22:37:02 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113622#comment-16113622
] 

Tejas Patil commented on SPARK-21595:
-------------------------------------

[~hvanhovell] : I am fine with either options you mentioned. 

one more option: Right now the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter`
spilling to disk) is controlled by a single threshold. If we de-couple those two using separate
thresholds, then the "spill on memory pressure" behavior will be achieved. The threshold for
in-memory can be kept small and keeping the spilling to disk higher will avoid excessive disk
spills. This is fairly simple change to do. What do you think ?

> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing
workflow
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-21595
>                 URL: https://issues.apache.org/jira/browse/SPARK-21595
>             Project: Spark
>          Issue Type: Bug
>          Components: Documentation, PySpark
>    Affects Versions: 2.2.0
>         Environment: pyspark on linux
>            Reporter: Stephan Reiling
>            Priority: Minor
>              Labels: documentation, regression
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
>         'association_idx',
>         sqlf.row_number().over(
>             Window.orderBy('uid1', 'uid2')
>         )
>     )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates one large
window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with out of memory
exception or too many open files exception, depending on memory settings (which is what I
tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 creates >110,000
files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords
crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 64.1 MB
to disk (0  time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords
crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 64.1 MB
to disk (1  time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I had missed
was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096
rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue. 
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold	2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a performance
hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable should be included
in the configuration documentation. 
> Maybe 4096 is too small of a default value?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message