hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Costa <psdc1...@gmail.com>
Subject Re: Partition - Distribution of map outputs
Date Wed, 30 Jun 2010 22:07:55 GMT
- I'm running the wordcount example that accepts 3 small txt files as input.
I assume that there will exist 3 mappers that produce 3 map outputs. One map
output per txt file, right?

When reduce tasks fetches the map outputs, they have to know which map
output he can get. How each reduce knows which map output can get? Who give
this information to him?

- The distribution of map outputs to the reducers calls partitioning?

- The reducers, during the sort phase, knows already which map output he
should get, right?

On Wed, Jun 30, 2010 at 9:36 PM, Arun C Murthy <acm@yahoo-inc.com> wrote:

> On Jun 30, 2010, at 1:29 PM, Pedro Costa wrote:
>> As I understand from what I've read, the partition has the purpose to tell
>> to each reducer which map output it will have.
>> For example, if I've 3 split files and 2 reduces defined in my example, on
>> the map side it wil be produce 3 map outputs (one map per split file) and on
>> the reduce side, it will be produced 2 part-* files. The part_00000 it will
>> contains the results of 2 map outputs and the part_00001 will contain the
>> results of 1 map output.
> Typically, each map produces output for each reduce.
> In your e.g. part-00000 will contain output of reduce-0 and part-00001 will
> contain output of reduce-1.
>  - My question is, where in the Hadoop MR is set "which Reduce contains
>> which Map Output"? Is it during the creation of the reduce tasks, or is in
>> another phase of the MR?
>> - Can you point me which class does this distribution of map outputs to
>> the reduce tasks?
> Take a look at the Partitioner - the partitioner for the job decides which
> keys are sent to which reduce.
> Arun


View raw message