Mailing-List: contact commits-help@spark.apache.org; run by ezmlm
Precedence: bulk
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
From: holden@apache.org
To: commits@spark.apache.org
Message-Id: <b6bdcd9a66774c5eb138912459b2e096@git.apache.org>
Subject: spark git commit: [SPARK-20232][PYTHON] Improve combineByKey docs
Date: Thu, 13 Apr 2017 19:43:33 +0000 (UTC)
archived-at: Thu, 13 Apr 2017 19:43:35 -0000

Repository: spark
Updated Branches:
  refs/heads/master fbe4216e1 -> 8ddf0d2a6


[SPARK-20232][PYTHON] Improve combineByKey docs

## What changes were proposed in this pull request?

Improve combineByKey documentation:

* Add note on memory allocation
* Change example code to use different mergeValue and mergeCombiners

## How was this patch tested?

Doctest.

## Legal

This is my original work and I license the work to the project under the project’s open source license.

Author: David Gingrich <david@textio.com>

Closes #17545 from dgingrich/topic-spark-20232-combinebykey-docs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ddf0d2a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ddf0d2a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ddf0d2a

Branch: refs/heads/master
Commit: 8ddf0d2a60795a2306f94df8eac6e265b1fe5230
Parents: fbe4216
Author: David Gingrich <david@textio.com>
Authored: Thu Apr 13 12:43:28 2017 -0700
Committer: Holden Karau <holden@us.ibm.com>
Committed: Thu Apr 13 12:43:28 2017 -0700

----------------------------------------------------------------------
 python/pyspark/rdd.py | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/8ddf0d2a/python/pyspark/rdd.py
----------------------------------------------------------------------
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index 291c1ca..6014179 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -1804,17 +1804,31 @@ class RDD(object):
               a one-element list)
             - C{mergeValue}, to merge a V into a C (e.g., adds it to the end of
               a list)
-            - C{mergeCombiners}, to combine two C's into a single one.
+            - C{mergeCombiners}, to combine two C's into a single one (e.g., merges
+              the lists)
+
+        To avoid memory allocation, both mergeValue and mergeCombiners are allowed to
+        modify and return their first argument instead of creating a new C.
 
         In addition, users can control the partitioning of the output RDD.
 
         .. note:: V and C can be different -- for example, one might group an RDD of type
             (Int, Int) into an RDD of type (Int, List[Int]).
 
-        >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
-        >>> def add(a, b): return a + str(b)
-        >>> sorted(x.combineByKey(str, add, add).collect())
-        [('a', '11'), ('b', '1')]
+        >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
+        >>> def to_list(a):
+        ...     return [a]
+        ...
+        >>> def append(a, b):
+        ...     a.append(b)
+        ...     return a
+        ...
+        >>> def extend(a, b):
+        ...     a.extend(b)
+        ...     return a
+        ...
+        >>> sorted(x.combineByKey(to_list, append, extend).collect())
+        [('a', [1, 2]), ('b', [1])]
         """
         if numPartitions is None:
             numPartitions = self._defaultReducePartitions()


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org