spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Ash (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
Date Wed, 09 Apr 2014 14:34:14 GMT

    [ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964207#comment-13964207
] 

Andrew Ash commented on SPARK-1021:
-----------------------------------

https://github.com/ash211/spark/commit/a62e828234d5b69585495593730032f2877932ae





> sortByKey() launches a cluster job when it shouldn't
> ----------------------------------------------------
>
>                 Key: SPARK-1021
>                 URL: https://issues.apache.org/jira/browse/SPARK-1021
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Andrew Ash
>              Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the documentation.
 But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the rdd.count()
call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
 We'd need to make sure that rangeBounds() is never called before an action is performed.
 This could be tricky because it's called in the RangePartitioner.equals() method.  Maybe
it's sufficient to just compare the number of partitions, the ids of the RDDs used to create
the RangePartitioner, and the sort ordering.  This still supports the case where I range-partition
one RDD and pass the same partitioner to a different RDD.  It breaks support for the case
where two range partitioners created on different RDDs happened to have the same rangeBounds(),
but it seems unlikely that this would really harm performance since it's probably unlikely
that the range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the discussion and
testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message