spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ma...@apache.org
Subject git commit: [SPARK-1468] Modify the partition function used by partitionBy.
Date Tue, 03 Jun 2014 20:31:41 GMT
Repository: spark
Updated Branches:
  refs/heads/branch-0.9 e03af416c -> 41e7853fc


[SPARK-1468] Modify the partition function used by partitionBy.

Make partitionBy use a tweaked version of hash as its default partition function
since the python hash function does not consistently assign the same value
to None across python processes.

Associated JIRA at https://issues.apache.org/jira/browse/SPARK-1468

Author: Erik Selin <erik.selin@jadedpixel.com>

Closes #371 from tyro89/consistent_hashing and squashes the following commits:

201c301 [Erik Selin] Make partitionBy use a tweaked version of hash as its default partition
function since the python hash function does not consistently assign the same value to None
across python processes.

(cherry picked from commit 8edc9d0330c94b50e01956ae88693cff4e0977b2)
Signed-off-by: Matei Zaharia <matei@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/41e7853f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/41e7853f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/41e7853f

Branch: refs/heads/branch-0.9
Commit: 41e7853fceee444d25e449188fb2bc321925fe46
Parents: e03af41
Author: Erik Selin <erik.selin@jadedpixel.com>
Authored: Tue Jun 3 13:31:16 2014 -0700
Committer: Matei Zaharia <matei@databricks.com>
Committed: Tue Jun 3 13:31:37 2014 -0700

----------------------------------------------------------------------
 python/pyspark/rdd.py | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/41e7853f/python/pyspark/rdd.py
----------------------------------------------------------------------
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index ace8476..06a390b 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -926,7 +926,7 @@ class RDD(object):
         return python_right_outer_join(self, other, numPartitions)
 
     # TODO: add option to control map-side combining
-    def partitionBy(self, numPartitions, partitionFunc=hash):
+    def partitionBy(self, numPartitions, partitionFunc=None):
         """
         Return a copy of the RDD partitioned using the specified partitioner.
 
@@ -937,6 +937,9 @@ class RDD(object):
         """
         if numPartitions is None:
             numPartitions = self.ctx.defaultParallelism
+
+        if partitionFunc is None:
+            partitionFunc = lambda x: 0 if x is None else hash(x)
         # Transferring O(n) objects to Java is too expensive.  Instead, we'll
         # form the hash buckets in Python, transferring O(numPartitions) objects
         # to Java.  Each object is a (splitNumber, [objects]) pair.


Mime
View raw message