Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of hadoopnode@gmail.com designates
 209.85.215.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=VQHn6m4rSe4mK1+oIP4pWyQudvXIM6TDJPzFiceY7gN3h1cyLHBJig0FV9bA8b2/e+
         D/2ExxUULD9R/pDpjWy6AR7G5U7/9KlkYB9TMOLSkQTerogZGRf+zq/UUYjgv1dHykz0
         2m9NP27CkMQV98W0iW05Tikw/kcTlGw9/KhIA=
MIME-Version: 1.0
In-Reply-To: <4CBCED1D.5020609@archive.org>
References: <AANLkTinOEaNof9Zd6cu+AEDrLKZ8atbDjA=9DrzQOPn3@mail.gmail.com>
	<4CBCDB96.6060909@uchicago.edu>
	<4CBCE91D.20501@archive.org>
	<4CBCED1D.5020609@archive.org>
Date: Tue, 19 Oct 2010 09:05:43 -0400
Message-ID: <AANLkTinvbkxKdcxt7T6-REBuHj6av0DvG1XZ_7=pb++8@mail.gmail.com>
Subject: Re: Reduce function
From: ed <hadoopnode@gmail.com>
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0015174c11ca50b5250492f7f330

--0015174c11ca50b5250492f7f330
Content-Type: text/plain; charset=ISO-8859-1

Keys are partitioned among the reducers using a partition function which is
specified in the aptly named Partitioner class.  By default, Hadoop will
hash the key (and probably mods the hash by the number of reducers) to
determine which reducer to send your key to (I say probably because I
haven't looked at the actual code).  What this means for you is that if you
set a custom bit in the key field, keys with different bits are not
guaranteed to go to the same reducers even if they rest of key is the same.

For example

     Key1 = (DataX+BitA) --> Reducer1
     Key2 = (DataX+BitB) --> Reducer2

What you want is for any key with the same Data to go to the same reducer
regardless of the bit value.  To do this you need to write your own
partitioner class and set your job to use that class using

     job.setPartitionerClass(MyCustomPartitioner.class)

Your custom partitioner will need to break apart your key and only hash on
the DataX part of it.

The partitioner class is really easy to override and will look something
like this:

     public class MyCustomPartitioner extends Partitioner<Key, Value> {
          public int getPartition(Key key, Value value, int numPartitions){
              //split my key so that the bit flag is removed
              //take the modified key and mod it by numPartitions
              return the result
          }
     }

Of course Key and Value would be whatever Key and Value class you're using.

Hope that helps.

~Ed


On Mon, Oct 18, 2010 at 8:58 PM, Brad Tofel <brad@archive.org> wrote:

> Whoops, just re-read your message, and see you may be asking about
> targeting a reduce callback function, not a reduce task..
>
> If that's the case, I'm not sure I understand what your "bit/tag" is for,
> and what you're trying to do with it. Can you provide a concrete example
> (not necessarily code) of some keys which need to group together?
>
> Is there a way to embed the "bit" within the value, so keys are always
> common?
>
> If you really need to fake out the system so different keys arrive in the
> same reduce, you might be able to do it with a combination of:
>
> org.apache.hadoop.mapreduce.Job
>
> .setSortComparatorClass()
> .setGroupingComparatorClass()
> .setPartitionerClass()
>
> Brad
>
>
> On 10/18/2010 05:41 PM, Brad Tofel wrote:
>
>> The "Partitioner" implementation used with your job should define which
>> reduce target receives a given map output key.
>>
>> I don't know if an existing Partitioner implementation exists which meets
>> your needs, but it's not a very complex interface to develop, if nothing
>> existing works for you.
>>
>> Brad
>>
>> On 10/18/2010 04:43 PM, Shi Yu wrote:
>>
>>> How many tags you have? If you have several number of tags, you'd better
>>> create a Vector class to hold those tags. And define sum function to
>>> increment the values of tags. Then the value class should be your new Vector
>>> class. That's better and more decent than the Textpair approach.
>>>
>>> Shi
>>>
>>> On 2010-10-18 5:19, Matthew John wrote:
>>>
>>>> Hi all,
>>>>
>>>> I had a small doubt regarding the reduce module. What I understand is
>>>> that
>>>> after the shuffle / sort phase , all the records with the same key value
>>>> goes into a reduce function. If thats the case, what is the attribute of
>>>> the
>>>> Writable key which ensures that all the keys go to the same reduce ?
>>>>
>>>> I am working on a reduce side Join where I need to tag all the keys with
>>>> a
>>>> bit which might vary but still want all those records to go into same
>>>> reduce. In Hadoop the Definitive Guide, pg. 235 they are using  TextPair
>>>> for
>>>> the key. But I dont understand how the keys with different tag
>>>> information
>>>> goes into the same reduce.
>>>>
>>>> Matthew
>>>>
>>>>
>>>
>>>
>>
>

--0015174c11ca50b5250492f7f330--