spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jurriaan Pruis (JIRA)" <>
Subject [jira] [Commented] (SPARK-16753) Spark SQL doesn't handle skewed dataset joins properly
Date Thu, 28 Jul 2016 08:00:35 GMT


Jurriaan Pruis commented on SPARK-16753:

I've set the following options:
spark.memory.offHeap.enabled	true
spark.memory.offHeap.size	1500M

Still getting errors like:
org.apache.spark.shuffle.FetchFailedException: Too large frame: 2690834287
java.lang.OutOfMemoryError: Unable to acquire 400 bytes of memory, got 0
ExecutorLostFailure (executor 23 exited caused by one of the running tasks) Reason: Container
marked as failed: container_1469605743211_0916_01_000052 on host:
Exit status: 52. Diagnostics: Exception from container-launch.

That task with the OutOfMemoryError:
Memory spill: 1808.1 MB
Shuffle Read MB/Records: 	656.9 MB / 5040711

There were tasks with way more memory spill, see screenshot.

> Spark SQL doesn't handle skewed dataset joins properly
> ------------------------------------------------------
>                 Key: SPARK-16753
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>            Reporter: Jurriaan Pruis
> I'm having issues with joining a 1 billion row dataframe with skewed data with multiple
dataframes with sizes ranging from 100,000 to 10 million rows. This means some of the joins
(about half of them) can be done using broadcast, but not all.
> Because the data in the large dataframe is skewed we get out of memory errors in the
executors or errors like: `org.apache.spark.shuffle.FetchFailedException: Too large frame`.
> We tried a lot of things, like broadcast joining the skewed rows separately and unioning
them with the dataset containing the sort merge joined data. Which works perfectly when doing
one or two joins, but when doing 10 joins like this the query planner gets confused (see [SPARK-15326]).
> As most of the rows are skewed on the NULL value we use a hack where we put unique values
in those NULL columns so the data is properly distributed over all partitions. This works
fine for NULL values, but since this table is growing rapidly and we have skewed data for
non-NULL values as well this isn't a full solution to the problem.
> Right now this specific spark task runs well 30% of the time and it's getting worse and
worse because of the increasing amount of data.
> How to approach these kinds of joins using Spark? It seems weird that I can't find proper
solutions for this problem/other people having the same kind of issues when Spark profiles
itself as a large-scale data processing engine. Doing joins on big datasets should be a thing
Spark should have no problem with out of the box.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message