Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of subir.sasikumar@gmail.com
 designates 209.85.161.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALwhT97vtdyTnmBU2Br0AcEB24jt6Z9N6o7Utwbd48YtYn8zPQ@mail.gmail.com>
References: <1341711420.55453.YahooMailNeo@web112109.mail.gq1.yahoo.com>
	<BFE9C55E-2066-40F5-98F4-A0E9480831B6@hortonworks.com>
	<1341863711.81310.YahooMailNeo@web112104.mail.gq1.yahoo.com>
	<FF5B218A-0D95-4498-A148-8978159F48F6@hortonworks.com>
	<1341890140.62753.YahooMailNeo@web112120.mail.gq1.yahoo.com>
	<CALwhT97vtdyTnmBU2Br0AcEB24jt6Z9N6o7Utwbd48YtYn8zPQ@mail.gmail.com>
Date: Tue, 10 Jul 2012 20:59:39 +0530
Message-ID: 
 <CA+m5d=ziFfX-yWh7hqQEDnLkwswKUVsd1qUGx-Z=S1sLCLxYpg@mail.gmail.com>
Subject: Re: Basic question on how reducer works
From: Subir S <subir.sasikumar@gmail.com>
To: mapreduce-user@hadoop.apache.org
Cc: Grandl Robert <rgrandl@yahoo.com>
Content-Type: text/plain; charset=ISO-8859-1

Is there any property to convey the maximum amount of data each
reducer/partition may take for processing. Like the bytes_per_reducer
of pig, so that the count of reducers can be controlled based on size
of intermediate map output data size?

On 7/10/12, Karthik Kambatla <kasha@cloudera.com> wrote:
> The partitioner is configurable. The default partitioner, from what I
> remember, computes the partition as the hashcode modulo number of
> reducers/partitions. For random input, it is balanced, but some cases can
> have very skewed key distribution. Also, as you have pointed out, the
> number of values per key can also vary. Together, both of them determine
> "weight" of each partition as you call it.
>
> Karthik
>
> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rgrandl@yahoo.com> wrote:
>
>> Thanks Arun.
>>
>> So just for my clarification. The map will create partitions according to
>> the number of reducers s.t. each reducer to get almost same number of
>> keys
>> in its partition. However, each key can have different number of values
>> so
>> the "weight" of each partition will depend on that. Also when a new <key,
>> value> is added into a partition a hash on the partition ID will be
>> computed to find the corresponding partition ?
>>
>> Robert
>>
>>   ------------------------------
>> *From:* Arun C Murthy <acm@hortonworks.com>
>> *To:* mapreduce-user@hadoop.apache.org
>> *Sent:* Monday, July 9, 2012 4:33 PM
>>
>> *Subject:* Re: Basic question on how reducer works
>>
>>
>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>>
>> Thanks a lot guys for answers.
>>
>> Still I am not able to find exactly the code for the following things:
>>
>> 1. reducer to read from a Map output only its partition. I looked into
>> ReduceTask#getMapOutput which do the actual read in
>> ReduceTask#shuffleInMemory, but I don't see where it specify which
>> partition to read(reduceID).
>>
>>
>> Look at TaskTracker.MapOutputServlet.
>>
>> 2. still don't understand very well in which part of the
>> code(MapTask.java) the intermediate data is written do which partition.
>> So
>> MapOutputBuffer is the one who actually writes the data to buffer and
>> spill
>> after buffer is full. Could you please elaborate a bit on how the data is
>> written to which partition ?
>>
>>
>> Essentially you can think of the partition-id as the 'primary key' and
>> the
>> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>>
>> hth,
>> Arun
>>
>> Thanks,
>> Robert
>>
>>   ------------------------------
>> *From:* Arun C Murthy <acm@hortonworks.com>
>> *To:* mapreduce-user@hadoop.apache.org
>> *Sent:* Monday, July 9, 2012 9:24 AM
>> *Subject:* Re: Basic question on how reducer works
>>
>> Robert,
>>
>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>
>> Hi,
>>
>> I have some questions related to basic functionality in Hadoop.
>>
>> 1. When a Mapper process the intermediate output data, how it knows how
>> many partitions to do(how many reducers will be) and how much data to go
>> in
>> each  partition for each reducer ?
>>
>> 2. A JobTracker when assigns a task to a reducer, it will also specify
>> the
>> locations of intermediate output data where it should retrieve it right ?
>> But how a reducer will know from each remote location with intermediate
>> output what portion it has to retrieve only ?
>>
>>
>> To add to Harsh's comment. Essentially the TT *knows* where the output of
>> a given map-id/reduce-id pair is present via an output-file/index-file
>> combination.
>>
>> Arun
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>>
>>
>>
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>>
>>
>>
>>
>