Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates
 209.85.160.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CA+m5d=ziFfX-yWh7hqQEDnLkwswKUVsd1qUGx-Z=S1sLCLxYpg@mail.gmail.com>
References: <1341711420.55453.YahooMailNeo@web112109.mail.gq1.yahoo.com>
 <BFE9C55E-2066-40F5-98F4-A0E9480831B6@hortonworks.com>
 <1341863711.81310.YahooMailNeo@web112104.mail.gq1.yahoo.com>
 <FF5B218A-0D95-4498-A148-8978159F48F6@hortonworks.com>
 <1341890140.62753.YahooMailNeo@web112120.mail.gq1.yahoo.com>
 <CALwhT97vtdyTnmBU2Br0AcEB24jt6Z9N6o7Utwbd48YtYn8zPQ@mail.gmail.com>
 <CA+m5d=ziFfX-yWh7hqQEDnLkwswKUVsd1qUGx-Z=S1sLCLxYpg@mail.gmail.com>
From: Harsh J <harsh@cloudera.com>
Date: Sat, 14 Jul 2012 11:38:14 +0530
Message-ID: 
 <CAOcnVr0kLReeuO_rJbZiCf1Lm-_kqaTJBWb-WmZhQy82Jg-_rA@mail.gmail.com>
Subject: Re: Basic question on how reducer works
To: mapreduce-user@hadoop.apache.org
Cc: Grandl Robert <rgrandl@yahoo.com>
Content-Type: text/plain; charset=ISO-8859-1

If you wish to impose a limit on the max reducer input to be allowed
in a job, you may set "mapreduce.reduce.input.limit" on your job, as
total bytes allowed per reducer.

But this is more of a hard limit, which I suspect your question wasn't
about. Your question is indeed better off on the pig's user lists.

On Tue, Jul 10, 2012 at 8:59 PM, Subir S <subir.sasikumar@gmail.com> wrote:
> Is there any property to convey the maximum amount of data each
> reducer/partition may take for processing. Like the bytes_per_reducer
> of pig, so that the count of reducers can be controlled based on size
> of intermediate map output data size?
>
> On 7/10/12, Karthik Kambatla <kasha@cloudera.com> wrote:
>> The partitioner is configurable. The default partitioner, from what I
>> remember, computes the partition as the hashcode modulo number of
>> reducers/partitions. For random input, it is balanced, but some cases can
>> have very skewed key distribution. Also, as you have pointed out, the
>> number of values per key can also vary. Together, both of them determine
>> "weight" of each partition as you call it.
>>
>> Karthik
>>
>> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rgrandl@yahoo.com> wrote:
>>
>>> Thanks Arun.
>>>
>>> So just for my clarification. The map will create partitions according to
>>> the number of reducers s.t. each reducer to get almost same number of
>>> keys
>>> in its partition. However, each key can have different number of values
>>> so
>>> the "weight" of each partition will depend on that. Also when a new <key,
>>> value> is added into a partition a hash on the partition ID will be
>>> computed to find the corresponding partition ?
>>>
>>> Robert
>>>
>>>   ------------------------------
>>> *From:* Arun C Murthy <acm@hortonworks.com>
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Sent:* Monday, July 9, 2012 4:33 PM
>>>
>>> *Subject:* Re: Basic question on how reducer works
>>>
>>>
>>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>>>
>>> Thanks a lot guys for answers.
>>>
>>> Still I am not able to find exactly the code for the following things:
>>>
>>> 1. reducer to read from a Map output only its partition. I looked into
>>> ReduceTask#getMapOutput which do the actual read in
>>> ReduceTask#shuffleInMemory, but I don't see where it specify which
>>> partition to read(reduceID).
>>>
>>>
>>> Look at TaskTracker.MapOutputServlet.
>>>
>>> 2. still don't understand very well in which part of the
>>> code(MapTask.java) the intermediate data is written do which partition.
>>> So
>>> MapOutputBuffer is the one who actually writes the data to buffer and
>>> spill
>>> after buffer is full. Could you please elaborate a bit on how the data is
>>> written to which partition ?
>>>
>>>
>>> Essentially you can think of the partition-id as the 'primary key' and
>>> the
>>> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>>>
>>> hth,
>>> Arun
>>>
>>> Thanks,
>>> Robert
>>>
>>>   ------------------------------
>>> *From:* Arun C Murthy <acm@hortonworks.com>
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Sent:* Monday, July 9, 2012 9:24 AM
>>> *Subject:* Re: Basic question on how reducer works
>>>
>>> Robert,
>>>
>>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>>
>>> Hi,
>>>
>>> I have some questions related to basic functionality in Hadoop.
>>>
>>> 1. When a Mapper process the intermediate output data, how it knows how
>>> many partitions to do(how many reducers will be) and how much data to go
>>> in
>>> each  partition for each reducer ?
>>>
>>> 2. A JobTracker when assigns a task to a reducer, it will also specify
>>> the
>>> locations of intermediate output data where it should retrieve it right ?
>>> But how a reducer will know from each remote location with intermediate
>>> output what portion it has to retrieve only ?
>>>
>>>
>>> To add to Harsh's comment. Essentially the TT *knows* where the output of
>>> a given map-id/reduce-id pair is present via an output-file/index-file
>>> combination.
>>>
>>> Arun
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>>
>>>
>>>
>>


-- 
Harsh J