Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4A15E10B12 for ; Wed, 23 Oct 2013 12:10:54 +0000 (UTC) Received: (qmail 76463 invoked by uid 500); 23 Oct 2013 12:10:52 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 76275 invoked by uid 500); 23 Oct 2013 12:10:49 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 76262 invoked by uid 99); 23 Oct 2013 12:10:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Oct 2013 12:10:48 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of josh.wills@gmail.com designates 209.85.215.46 as permitted sender) Received: from [209.85.215.46] (HELO mail-la0-f46.google.com) (209.85.215.46) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Oct 2013 12:10:43 +0000 Received: by mail-la0-f46.google.com with SMTP id hp15so558500lab.5 for ; Wed, 23 Oct 2013 05:10:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=m88mTcsd2K2HoKWzDGqebfIgyZ8f9TIXI1f5HogPV5c=; b=VPOfXgWQlH4n6KI2HpCqEc9lRoQ99IwDfp1CF/gmyWhxc3+YzuKGgM5D7rZoCFEMPi G5BRkcTFdDPhQRAD9/nHDlS9XbcyFDUGZaYBeBO1cebZln+ViDbM9z36EVqpZ44MExv2 fggYo2e5SqBDDrPs3J9JO3MKuZ3yvrHS38U+j/Ql0J+bTkn5jiZvyd7wUf7/uCoU8kA/ d2ke3FvsLMuwx2FcpVBnv/L/J2b07nsIOcZxWf4g9v6o9+UghaE2w8v+S14cmqA4xFpW Q5wBnJXVVyTYOJY/kRmqQSbFE4nAMit56XHOQyUb6gr1sFn2cikbq315DraDVF3WpESc +MUw== X-Received: by 10.112.182.66 with SMTP id ec2mr4056lbc.58.1382530220072; Wed, 23 Oct 2013 05:10:20 -0700 (PDT) MIME-Version: 1.0 Received: by 10.112.146.194 with HTTP; Wed, 23 Oct 2013 05:10:00 -0700 (PDT) In-Reply-To: References: From: Josh Wills Date: Wed, 23 Oct 2013 05:10:00 -0700 Message-ID: Subject: Re: ability to specify a different function for combiner & reducer To: dev@crunch.apache.org Content-Type: multipart/alternative; boundary=001a11c37aaea8b03404e9676620 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c37aaea8b03404e9676620 Content-Type: text/plain; charset=ISO-8859-1 I certainly understand the issue; do you prefer the two-function solution to one in which we added a method in DoFn to indicate which phase of the MR job a particular DoFn was being executed in? We might have options like MAP, REDUCE, COMBINE, or IN_MEMORY. (I'm not totally sure if such a solution would work for all cases, so someone please call me out on that if there's something I'm missing.) J On Wed, Oct 23, 2013 at 12:47 AM, Stefan De Smit wrote: > Hi, > > I encountered a situation where I need different behaviour of my CombineFn > during combine & reduce phase. > Basically, I have a collection of avro records that I need to combine. > For some of these, I have so many records with same key that I need to > combine them first to make my job work (memory & timing constraints) > For others, I can't combine them, because I need all records together. > So, basically I would want to know in my function if it's combining or > reducing. > The only way to solve my problem in crunch right now seems to be to first > split my collection in 2 different collections, combine them separately & > union them again. > But this give a lot of overhead for something that would be supported by > native M/R. > > I looked in the code and it seems that crunch internally has a NodeContext > object to indicate COMBINE or REDUCE, but this context is not accessible in > the DoFn. > As the (RT)Node object is an internal crunch object, it's also not a clean > solution to expose the NodeContext. > So, as a better solution, it would be possible to create a new method: > combineValues(combineFn, reduceFn) on PGroupedTable. The existing > combineValues(combineFn) is in that case just a convenience method for must > use cases, where the combineFn & reduceFn is the same function. > With this new method, I would be able to just create my combineFn twice & > pass a boolean in the constructor to indicate if it's combine or reduce. > > I already made a patch to add this function, but as the procedure > indicates to discuss the change first, I'll write this mail first to check > what you think. (I also didn't test my patch yet, although all unit & IT > still pass) > > Thanks > Stefan > > --001a11c37aaea8b03404e9676620--