spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Hamstra (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()
Date Tue, 27 May 2014 16:40:03 GMT

    [ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009875#comment-14009875
] 

Mark Hamstra commented on SPARK-983:
------------------------------------

I'm hoping these can be kept orthogonal, but I think that it is worth noting the existence
of https://issues.apache.org/jira/browse/SPARK-1021 and the fact that sortByKey as it currently
exists breaks Spark's "transformations of RDDs are lazy" contract.  I'm currently working
on that issue, which is undoubtedly going to require at least some merge work to be compatible
with the resolution of this issue.

> Support external sorting for RDD#sortByKey()
> --------------------------------------------
>
>                 Key: SPARK-983
>                 URL: https://issues.apache.org/jira/browse/SPARK-983
>             Project: Spark
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>            Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a buffer to
hold the entire partition, then sorts it. This will cause an OOM if an entire partition cannot
fit in memory, which is especially problematic for skewed data. Rather than OOMing, the behavior
should be similar to the [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message