Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 27B6410C3E for ; Sun, 4 May 2014 08:41:06 +0000 (UTC) Received: (qmail 62008 invoked by uid 500); 4 May 2014 08:41:04 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 61519 invoked by uid 500); 4 May 2014 08:41:03 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@spark.apache.org Delivered-To: mailing list dev@spark.apache.org Received: (qmail 61504 invoked by uid 99); 4 May 2014 08:41:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 May 2014 08:41:02 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dbtsai@stanford.edu designates 171.67.219.81 as permitted sender) Received: from [171.67.219.81] (HELO smtp.stanford.edu) (171.67.219.81) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 May 2014 08:40:58 +0000 Received: from smtp.stanford.edu (localhost [127.0.0.1]) by localhost (Postfix) with SMTP id 5C0CF21869 for ; Sun, 4 May 2014 01:40:34 -0700 (PDT) Received: from mail-qg0-f48.google.com (mail-qg0-f48.google.com [209.85.192.48]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: dbtsai) by smtp.stanford.edu (Postfix) with ESMTPSA id DA2B720C25 for ; Sun, 4 May 2014 01:40:33 -0700 (PDT) Received: by mail-qg0-f48.google.com with SMTP id i50so5082413qgf.7 for ; Sun, 04 May 2014 01:40:33 -0700 (PDT) X-Gm-Message-State: ALoCoQn1tOxp6oFj+Loeo4ZlYD7Exs9k0BrvCaZxa2qZrWznGFt58QUYGghPfMFD1BUNASHNYeBF MIME-Version: 1.0 X-Received: by 10.224.15.137 with SMTP id k9mr625495qaa.104.1399192833036; Sun, 04 May 2014 01:40:33 -0700 (PDT) Received: by 10.229.96.201 with HTTP; Sun, 4 May 2014 01:40:32 -0700 (PDT) In-Reply-To: References: Date: Sun, 4 May 2014 01:40:32 -0700 Message-ID: Subject: Re: reduce, transform, combine From: DB Tsai To: dev@spark.apache.org Content-Type: multipart/alternative; boundary=047d7bdc912ec95ef204f88ef762 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc912ec95ef204f88ef762 Content-Type: text/plain; charset=UTF-8 You could easily achieve this by mapPartition. However, it seems that it can not be done by using aggregate type of operation. I can see that it's a general useful operation. For now, you could use mapPartition. Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 4, 2014 at 1:12 AM, Manish Amde wrote: > I am currently using the RDD aggregate operation to reduce (fold) per > partition and then combine using the RDD aggregate operation. > def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) > => U): U > > I need to perform a transform operation after the seqOp and before the > combOp. The signature would look like > def foldTransformCombine[U: ClassTag](zeroReduceValue: V, zeroCombineValue: > U)(seqOp: (V, T) => V, transformOp: (V) => U, combOp: (U, U) => U): U > > This is especially useful in the scenario where the transformOp is > expensive and should be performed once per partition before combining. Is > there a way to accomplish this with existing RDD operations? If yes, great > but if not, should we consider adding such a general transformation to the > list of RDD operations? > > -Manish > --047d7bdc912ec95ef204f88ef762--