hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Basic question on how reducer works
Date Sat, 14 Jul 2012 06:08:14 GMT
If you wish to impose a limit on the max reducer input to be allowed
in a job, you may set "mapreduce.reduce.input.limit" on your job, as
total bytes allowed per reducer.

But this is more of a hard limit, which I suspect your question wasn't
about. Your question is indeed better off on the pig's user lists.

On Tue, Jul 10, 2012 at 8:59 PM, Subir S <subir.sasikumar@gmail.com> wrote:
> Is there any property to convey the maximum amount of data each
> reducer/partition may take for processing. Like the bytes_per_reducer
> of pig, so that the count of reducers can be controlled based on size
> of intermediate map output data size?
>
> On 7/10/12, Karthik Kambatla <kasha@cloudera.com> wrote:
>> The partitioner is configurable. The default partitioner, from what I
>> remember, computes the partition as the hashcode modulo number of
>> reducers/partitions. For random input, it is balanced, but some cases can
>> have very skewed key distribution. Also, as you have pointed out, the
>> number of values per key can also vary. Together, both of them determine
>> "weight" of each partition as you call it.
>>
>> Karthik
>>
>> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rgrandl@yahoo.com> wrote:
>>
>>> Thanks Arun.
>>>
>>> So just for my clarification. The map will create partitions according to
>>> the number of reducers s.t. each reducer to get almost same number of
>>> keys
>>> in its partition. However, each key can have different number of values
>>> so
>>> the "weight" of each partition will depend on that. Also when a new <key,
>>> value> is added into a partition a hash on the partition ID will be
>>> computed to find the corresponding partition ?
>>>
>>> Robert
>>>
>>>   ------------------------------
>>> *From:* Arun C Murthy <acm@hortonworks.com>
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Sent:* Monday, July 9, 2012 4:33 PM
>>>
>>> *Subject:* Re: Basic question on how reducer works
>>>
>>>
>>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>>>
>>> Thanks a lot guys for answers.
>>>
>>> Still I am not able to find exactly the code for the following things:
>>>
>>> 1. reducer to read from a Map output only its partition. I looked into
>>> ReduceTask#getMapOutput which do the actual read in
>>> ReduceTask#shuffleInMemory, but I don't see where it specify which
>>> partition to read(reduceID).
>>>
>>>
>>> Look at TaskTracker.MapOutputServlet.
>>>
>>> 2. still don't understand very well in which part of the
>>> code(MapTask.java) the intermediate data is written do which partition.
>>> So
>>> MapOutputBuffer is the one who actually writes the data to buffer and
>>> spill
>>> after buffer is full. Could you please elaborate a bit on how the data is
>>> written to which partition ?
>>>
>>>
>>> Essentially you can think of the partition-id as the 'primary key' and
>>> the
>>> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>>>
>>> hth,
>>> Arun
>>>
>>> Thanks,
>>> Robert
>>>
>>>   ------------------------------
>>> *From:* Arun C Murthy <acm@hortonworks.com>
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Sent:* Monday, July 9, 2012 9:24 AM
>>> *Subject:* Re: Basic question on how reducer works
>>>
>>> Robert,
>>>
>>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>>
>>> Hi,
>>>
>>> I have some questions related to basic functionality in Hadoop.
>>>
>>> 1. When a Mapper process the intermediate output data, how it knows how
>>> many partitions to do(how many reducers will be) and how much data to go
>>> in
>>> each  partition for each reducer ?
>>>
>>> 2. A JobTracker when assigns a task to a reducer, it will also specify
>>> the
>>> locations of intermediate output data where it should retrieve it right ?
>>> But how a reducer will know from each remote location with intermediate
>>> output what portion it has to retrieve only ?
>>>
>>>
>>> To add to Harsh's comment. Essentially the TT *knows* where the output of
>>> a given map-id/reduce-id pair is present via an output-file/index-file
>>> combination.
>>>
>>> Arun
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>>
>>>
>>>
>>



-- 
Harsh J

Mime
View raw message