Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 42185 invoked from network); 19 Oct 2010 13:06:18 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Oct 2010 13:06:18 -0000 Received: (qmail 26644 invoked by uid 500); 19 Oct 2010 13:06:16 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 26335 invoked by uid 500); 19 Oct 2010 13:06:11 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 26324 invoked by uid 99); 19 Oct 2010 13:06:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Oct 2010 13:06:10 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hadoopnode@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-ew0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Oct 2010 13:06:04 +0000 Received: by ewy28 with SMTP id 28so1574567ewy.35 for ; Tue, 19 Oct 2010 06:05:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=H6qPnKoNHM2zeM+Usc/X04hhqqmKVtLuEqOsvNFNKf8=; b=Ed6I+UMKwI4f7B509QC0inBkq2/OvsZrJSUHMX0++fXxsFAS7Q6sG/9pYrXMq80GZ/ P8Ii2rgAWqOg2QqIBJrrljwhFryLhSuSu53E8XtPjYaFUmwM/lgegst/dHE5kb7E0AYZ vGvSw5PlW7exsn8DQjJyG78Al3+hAUNWDtEdY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=VQHn6m4rSe4mK1+oIP4pWyQudvXIM6TDJPzFiceY7gN3h1cyLHBJig0FV9bA8b2/e+ D/2ExxUULD9R/pDpjWy6AR7G5U7/9KlkYB9TMOLSkQTerogZGRf+zq/UUYjgv1dHykz0 2m9NP27CkMQV98W0iW05Tikw/kcTlGw9/KhIA= MIME-Version: 1.0 Received: by 10.213.16.207 with SMTP id p15mr5307496eba.61.1287493543572; Tue, 19 Oct 2010 06:05:43 -0700 (PDT) Received: by 10.213.29.81 with HTTP; Tue, 19 Oct 2010 06:05:43 -0700 (PDT) In-Reply-To: <4CBCED1D.5020609@archive.org> References: <4CBCDB96.6060909@uchicago.edu> <4CBCE91D.20501@archive.org> <4CBCED1D.5020609@archive.org> Date: Tue, 19 Oct 2010 09:05:43 -0400 Message-ID: Subject: Re: Reduce function From: ed To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0015174c11ca50b5250492f7f330 X-Virus-Checked: Checked by ClamAV on apache.org --0015174c11ca50b5250492f7f330 Content-Type: text/plain; charset=ISO-8859-1 Keys are partitioned among the reducers using a partition function which is specified in the aptly named Partitioner class. By default, Hadoop will hash the key (and probably mods the hash by the number of reducers) to determine which reducer to send your key to (I say probably because I haven't looked at the actual code). What this means for you is that if you set a custom bit in the key field, keys with different bits are not guaranteed to go to the same reducers even if they rest of key is the same. For example Key1 = (DataX+BitA) --> Reducer1 Key2 = (DataX+BitB) --> Reducer2 What you want is for any key with the same Data to go to the same reducer regardless of the bit value. To do this you need to write your own partitioner class and set your job to use that class using job.setPartitionerClass(MyCustomPartitioner.class) Your custom partitioner will need to break apart your key and only hash on the DataX part of it. The partitioner class is really easy to override and will look something like this: public class MyCustomPartitioner extends Partitioner { public int getPartition(Key key, Value value, int numPartitions){ //split my key so that the bit flag is removed //take the modified key and mod it by numPartitions return the result } } Of course Key and Value would be whatever Key and Value class you're using. Hope that helps. ~Ed On Mon, Oct 18, 2010 at 8:58 PM, Brad Tofel wrote: > Whoops, just re-read your message, and see you may be asking about > targeting a reduce callback function, not a reduce task.. > > If that's the case, I'm not sure I understand what your "bit/tag" is for, > and what you're trying to do with it. Can you provide a concrete example > (not necessarily code) of some keys which need to group together? > > Is there a way to embed the "bit" within the value, so keys are always > common? > > If you really need to fake out the system so different keys arrive in the > same reduce, you might be able to do it with a combination of: > > org.apache.hadoop.mapreduce.Job > > .setSortComparatorClass() > .setGroupingComparatorClass() > .setPartitionerClass() > > Brad > > > On 10/18/2010 05:41 PM, Brad Tofel wrote: > >> The "Partitioner" implementation used with your job should define which >> reduce target receives a given map output key. >> >> I don't know if an existing Partitioner implementation exists which meets >> your needs, but it's not a very complex interface to develop, if nothing >> existing works for you. >> >> Brad >> >> On 10/18/2010 04:43 PM, Shi Yu wrote: >> >>> How many tags you have? If you have several number of tags, you'd better >>> create a Vector class to hold those tags. And define sum function to >>> increment the values of tags. Then the value class should be your new Vector >>> class. That's better and more decent than the Textpair approach. >>> >>> Shi >>> >>> On 2010-10-18 5:19, Matthew John wrote: >>> >>>> Hi all, >>>> >>>> I had a small doubt regarding the reduce module. What I understand is >>>> that >>>> after the shuffle / sort phase , all the records with the same key value >>>> goes into a reduce function. If thats the case, what is the attribute of >>>> the >>>> Writable key which ensures that all the keys go to the same reduce ? >>>> >>>> I am working on a reduce side Join where I need to tag all the keys with >>>> a >>>> bit which might vary but still want all those records to go into same >>>> reduce. In Hadoop the Definitive Guide, pg. 235 they are using TextPair >>>> for >>>> the key. But I dont understand how the keys with different tag >>>> information >>>> goes into the same reduce. >>>> >>>> Matthew >>>> >>>> >>> >>> >> > --0015174c11ca50b5250492f7f330--