spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From r...@apache.org
Subject [6/6] git commit: Merge pull request #106 from pwendell/master
Date Fri, 25 Oct 2013 00:08:54 GMT
Merge pull request #106 from pwendell/master

Add a `repartition` operator.

This patch adds an operator called repartition with more straightforward
semantics than the current `coalesce` operator. There are a few use cases
where this operator is useful:

1. If a user wants to increase the number of partitions in the RDD. This
is more common now with streaming. E.g. a user is ingesting data on one
node but they want to add more partitions to ensure parallelism of
subsequent operations across threads or the cluster.

Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
super confusing.

2. If a user has input data where the number of partitions is not known. E.g.

> sc.textFile("some file").coalesce(50)....

This is both vague semantically (am I growing or shrinking this RDD) but also,
may not work correctly if the base RDD has fewer than 50 partitions.

The new operator forces shuffles every time, so it will always produce exactly
the number of new partitions. It also throws an exception rather than silently
not-working if a bad input is passed.

I am currently adding streaming tests (requires refactoring some of the test
suite to allow testing at partition granularity), so this is not ready for
merge yet. But feedback is welcome.


Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/99ad4a61
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/99ad4a61
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/99ad4a61

Branch: refs/heads/master
Commit: 99ad4a613a859d1ea246829b3681b3f30fa92527
Parents: 1dc776b 39f6f75
Author: Reynold Xin <rxin@apache.org>
Authored: Thu Oct 24 17:08:39 2013 -0700
Committer: Reynold Xin <rxin@apache.org>
Committed: Thu Oct 24 17:08:39 2013 -0700

----------------------------------------------------------------------
 .../main/scala/org/apache/spark/rdd/RDD.scala   | 13 +++++
 .../scala/org/apache/spark/rdd/RDDSuite.scala   | 20 +++++++
 docs/streaming-programming-guide.md             |  4 ++
 .../org/apache/spark/streaming/DStream.scala    |  7 +++
 .../apache/spark/streaming/JavaTestUtils.scala  |  3 +-
 .../spark/streaming/BasicOperationsSuite.scala  | 38 ++++++++++++
 .../spark/streaming/CheckpointSuite.scala       |  4 +-
 .../apache/spark/streaming/TestSuiteBase.scala  | 61 ++++++++++++++++++--
 8 files changed, 140 insertions(+), 10 deletions(-)
----------------------------------------------------------------------



Mime
View raw message