Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E45E3200C55 for ; Thu, 13 Apr 2017 21:43:34 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E2F34160B98; Thu, 13 Apr 2017 19:43:34 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 35319160B89 for ; Thu, 13 Apr 2017 21:43:34 +0200 (CEST) Received: (qmail 61667 invoked by uid 500); 13 Apr 2017 19:43:33 -0000 Mailing-List: contact commits-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list commits@spark.apache.org Received: (qmail 61658 invoked by uid 99); 13 Apr 2017 19:43:33 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Apr 2017 19:43:33 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 4E3FCDFDAC; Thu, 13 Apr 2017 19:43:33 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: holden@apache.org To: commits@spark.apache.org Message-Id: X-Mailer: ASF-Git Admin Mailer Subject: spark git commit: [SPARK-20232][PYTHON] Improve combineByKey docs Date: Thu, 13 Apr 2017 19:43:33 +0000 (UTC) archived-at: Thu, 13 Apr 2017 19:43:35 -0000 Repository: spark Updated Branches: refs/heads/master fbe4216e1 -> 8ddf0d2a6 [SPARK-20232][PYTHON] Improve combineByKey docs ## What changes were proposed in this pull request? Improve combineByKey documentation: * Add note on memory allocation * Change example code to use different mergeValue and mergeCombiners ## How was this patch tested? Doctest. ## Legal This is my original work and I license the work to the project under the project’s open source license. Author: David Gingrich Closes #17545 from dgingrich/topic-spark-20232-combinebykey-docs. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ddf0d2a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ddf0d2a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ddf0d2a Branch: refs/heads/master Commit: 8ddf0d2a60795a2306f94df8eac6e265b1fe5230 Parents: fbe4216 Author: David Gingrich Authored: Thu Apr 13 12:43:28 2017 -0700 Committer: Holden Karau Committed: Thu Apr 13 12:43:28 2017 -0700 ---------------------------------------------------------------------- python/pyspark/rdd.py | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/8ddf0d2a/python/pyspark/rdd.py ---------------------------------------------------------------------- diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py index 291c1ca..6014179 100644 --- a/python/pyspark/rdd.py +++ b/python/pyspark/rdd.py @@ -1804,17 +1804,31 @@ class RDD(object): a one-element list) - C{mergeValue}, to merge a V into a C (e.g., adds it to the end of a list) - - C{mergeCombiners}, to combine two C's into a single one. + - C{mergeCombiners}, to combine two C's into a single one (e.g., merges + the lists) + + To avoid memory allocation, both mergeValue and mergeCombiners are allowed to + modify and return their first argument instead of creating a new C. In addition, users can control the partitioning of the output RDD. .. note:: V and C can be different -- for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]). - >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) - >>> def add(a, b): return a + str(b) - >>> sorted(x.combineByKey(str, add, add).collect()) - [('a', '11'), ('b', '1')] + >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)]) + >>> def to_list(a): + ... return [a] + ... + >>> def append(a, b): + ... a.append(b) + ... return a + ... + >>> def extend(a, b): + ... a.extend(b) + ... return a + ... + >>> sorted(x.combineByKey(to_list, append, extend).collect()) + [('a', [1, 2]), ('b', [1])] """ if numPartitions is None: numPartitions = self._defaultReducePartitions() --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org For additional commands, e-mail: commits-help@spark.apache.org