Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CAD19DA56 for ; Sun, 4 Nov 2012 12:47:17 +0000 (UTC) Received: (qmail 25372 invoked by uid 500); 4 Nov 2012 12:47:16 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 25299 invoked by uid 500); 4 Nov 2012 12:47:15 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 25245 invoked by uid 99); 4 Nov 2012 12:47:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Nov 2012 12:47:12 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [212.227.126.171] (HELO moutng.kundenserver.de) (212.227.126.171) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Nov 2012 12:47:03 +0000 Received: from mafr.de (krlh-4d035ff8.pool.mediaWays.net [77.3.95.248]) by mrelayeu.kundenserver.de (node=mreu3) with ESMTP (Nemesis) id 0MJ1gb-1TXTVT1xXd-002UYT; Sun, 04 Nov 2012 13:46:42 +0100 Date: Sun, 4 Nov 2012 13:46:40 +0100 From: Matthias Friedrich To: crunch-dev@incubator.apache.org Subject: Re: CombineFn vs. Aggregator Message-ID: <20121104124640.GA17542@mafr.de> Mail-Followup-To: crunch-dev@incubator.apache.org References: <20121102101316.GA16314@mafr.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121102101316.GA16314@mafr.de> User-Agent: Mutt/1.5.21 (2010-09-15) X-Provags-ID: V02:K0:nIKATFTwJE9kAbaAPLeIVZMSVPmds0Sl/fMGA5EO3il yflRJXP4TzCTDTVPnzl5CxOBqc0FTA+Gu3gMpWx2mckJ4owBn+ KxmS/+TwX1wqG1e7d1pfNT112ajbiZfXwRkw+XggBX/eDXoLvT mCoMVWtPn4QHRjRCqaKIaqlb0nZGtv7Qub4gG0nq97+ZWYihPD 62xr4Glop9bL0+V1Zxx3iMYP/eiF/g7rRkU8DEWOuVaMUMjqtU PTW1MdhlZgbdOE/7tBsI2zlu4RDAYW9xLIRPla1AHKRKzCZXaA /wjbVHCOVFBfLys0s/OIDmOBHPQJRGYJ7tP0zqRQVaK1dN6Xg= = X-Virus-Checked: Checked by ClamAV on apache.org Hi, having looked at every single CombineFn and Aggregator in Crunch's code base and in light of CRUNCH-106 [1], I wonder if we could make things even easier to use. How about changing the Aggregator interface to this: public interface Aggregator extends Serializable { void initialize(Configuration conf); Iterable aggregate(Iterable values); } This would make writing anonymous inner classes for combineValues() much easier and shorter. Implementations could still keep a buffer between calls to aggregate() if they want to optimize memory management. Is there anything I'm overlooking? Regards, Matthias [1] https://issues.apache.org/jira/browse/CRUNCH-106 On Friday, 2012-11-02, Matthias Friedrich wrote: > Hi, > > in our last discussion on refactoring we agreed (I think) to clean > up CombineFn and move implementations to .fn.CombineFns. While playing > with the API, I noticed that it's harder than necessary to get your > generics in line. Example: > > PGroupedTable table = /* ... */; > table.combineValues(CombineFn.SUM_INTS()); > > You need to specify the String type or you get a compilation error. > This leaks an implementation detail that isn't obvious to users; you > need to know how CombineFn works to know that you have to specify the > key type. And it's just plain ugly. > > How about using Aggregator more prominently by adding a > combineValues(Aggregator a) method to PGroupedTable? We could > simplify the code and make it easier to understand: > > PGroupedTable table = /* ... */; > table.combineValues(Aggregators.SUM_INTS()); > > Writing custom aggregations would be easier, too, as you don't need to > decorate your aggregator with AggregatorCombineFn yourself anymore. We > could go even further and remove combineValues(CombineFn), because it > doesn't add much value anymore. If you need maximum flexibility (that > is, your aggregation depends on the table's key somehow), you can > still use parallelDo() with CombineFn directly. I don't think that's > a common use case though. > > Here's what I propose: > > 1) Leave CombineFn and Aggregator in base > 2) Remove all static factories (SUM_INTS etc.) from CombineFn > 3) Create an Aggregators class in .fn with static factories > for everything removed from CombineFn > 4) Make the implementation classes private, they just take up > lots of space in javaodcs > 5) Bonus: Remove combineValues(CombineFn) > > One thing I don't understand: Is there really a need for > AggregatorFactory? Am I overlooking something? It's not like we have > to create Aggregator instances for each key -- that's what > Aggregator.reset() is for. > > Regards, > Matthias