Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3C6AE11991 for ; Wed, 30 Jul 2014 13:53:06 +0000 (UTC) Received: (qmail 96955 invoked by uid 500); 30 Jul 2014 13:53:06 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 96925 invoked by uid 500); 30 Jul 2014 13:53:06 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 96912 invoked by uid 99); 30 Jul 2014 13:53:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jul 2014 13:53:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of davidwhiting@gmail.com designates 74.125.82.175 as permitted sender) Received: from [74.125.82.175] (HELO mail-we0-f175.google.com) (74.125.82.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jul 2014 13:53:00 +0000 Received: by mail-we0-f175.google.com with SMTP id t60so1237868wes.6 for ; Wed, 30 Jul 2014 06:52:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=a+J0sIRv9KdhNzi1UcmHLXdfA8tfEdKMCoHpw+6J3PU=; b=L87QKWpD5ADmbmjeTdZw18R/sjkJPkUqCxUZBGGH5crwacBuiF1SVl2+1TLu2eg6VH UZ3zvBJtsjnN3tUJGc1Og+squ7MgHsZYIoGCOlSCIibQXvtYiLLipFgBW8Q1WRwNLiXq /NxsJgoACSr/NKQZrhoQv0clGe3k1Yf0+H5/QGCOFmRV6dFBKhlWgCtdOWLRMiqUdGdD E4spL3nVVGPmD1siowYdZwhh+qWqPEWSdNLOe74mODi1O9/Qsg6APgJr/aQF5y7c9yqj CvbiBo/oTEeaBuXMk5AfVjZB8K68itOMRNlfzqPiWdsWsy5Ucjh1luU7c9LMBhHyoBTp 3xng== MIME-Version: 1.0 X-Received: by 10.194.174.35 with SMTP id bp3mr6226469wjc.33.1406728358759; Wed, 30 Jul 2014 06:52:38 -0700 (PDT) Received: by 10.194.140.97 with HTTP; Wed, 30 Jul 2014 06:52:38 -0700 (PDT) Date: Wed, 30 Jul 2014 09:52:38 -0400 Message-ID: Subject: Use of Iterable with combine in Scrunch From: David Whiting To: dev@crunch.apache.org Content-Type: multipart/alternative; boundary=089e01419ffa1e97a204ff697834 X-Virus-Checked: Checked by ClamAV on apache.org --089e01419ffa1e97a204ff697834 Content-Type: text/plain; charset=UTF-8 The Scrunch version of combine accepts a function Iterable[V] => V . This causes a lot of unexpected behaviour because the iterable that is wrapped is actually a SingleUseIterable, and much of Scala's collection function implementations actually try and access the underlying iterator multiple times if they know that it's possible. This leads to often having to write code like this: ... .groupByKey() .combine { _.iterator reduce { _ + _ } } This is a silly example of course, because there's an Aggregator for summation, but if your reduce function is more complex you have to do this indirection via iterator in order to get correct behaviour. Possible fixes: a) Change combine to accept a function TraversableOnce[V] => V or Iterator[V] => V, better reflecting the single-use nature of the underlying Iterable b) Given that most custom combines will in fact be folds over monoids, we could promote the notion of reduce or fold up the the PGroupedTable itself, so you can do .groupByKey().foldValues(_+_) --089e01419ffa1e97a204ff697834--