spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Herman van Hovell (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks
Date Fri, 25 Nov 2016 17:09:58 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696321#comment-15696321
] 

Herman van Hovell edited comment on SPARK-17788 at 11/25/16 5:09 PM:
---------------------------------------------------------------------

That is fair. The solution is not that straightforward TBH:
- Always add some kind of tie breaking value to the range. This could be random, but I'd rather
add something like monotonically_increasing_id(). This always incurs some cost.
- Only add a tie-breaker when the you have (suspect) skew. Here we need to add some heavy
hitter algorithm, which is potentially much more resource intensive than reservoir sampling.
The other thing is that when we suspect skew, we would need to scan the data again (which
would make the total of scans 3).

So I would be slightly in favor of option 1 and a flag to disable it.


was (Author: hvanhovell):
That is fair. The solution is not that straightforward TBH:
- Always add some kind of tie breaking value to the range. This could be random, but I'd rather
add something like monotonically_increasing_id(). This always incurs some cost.
- Only add a tie-breaker when the you have (suspect) skew. Here we need to add some heavy
hitter algorithm, which is potentially much more resource intensive than reservoir sampling.
The other thing is that when we suspect skew, we would need to scan the data again (which
would make the total of scans 3).
So I would be slightly in favor of option 1 and a flag to disable it.

> RangePartitioner results in few very large tasks and many small to empty tasks 
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-17788
>                 URL: https://issues.apache.org/jira/browse/SPARK-17788
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>            Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in Spark (~140GB
for the entire table, this single field is a Double, ~1.4B records) and look at the sorted
output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more than
17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner trying to
create equal ranges. [1]
> [1] https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html

>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of the data
which roughly equates 1 billion records), other numbers in the dataset are as high as 2000.
With the RangePartitioner trying to create equal ranges, some tasks are becoming almost empty
while others are extremely large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. I hope one
of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message