hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Seigel <ja...@tynt.com>
Subject Re: When does Reduce job start
Date Wed, 05 Jan 2011 01:18:53 GMT
As the other gentleman said. The reduce task kinda needs to know all
the data is available before doing its work.

By design.


Sent from my mobile. Please excuse the typos.

On 2011-01-04, at 6:14 PM, sagar naik <snaik@attributor.com> wrote:

> Hi Jeff,
> To be clear on my end I m not talking abt reduce () function call but
> spawning of reduce process/task itself
> To rephrase:
>   Reduce Process/Task is not started untill 90% of map task are done
> -Sagar
> On Tue, Jan 4, 2011 at 3:14 PM, Jeff Bean <jwfbean@cloudera.com> wrote:
>> It's part of the design that reduce() does not get called until the map
>> phase is complete. You're seeing reduce report as started when map is at 90%
>> complete because hadoop is shuffling data from the mappers that have
>> completed. As currently designed, you can't prematurely start reduce()
>> because there is no way to gaurantee you have all the values for any key
>> until all the mappers are done. reduce() requires a key and all the values
>> for that key in order to execute.
>> Jeff
>> On Tue, Jan 4, 2011 at 10:53 AM, sagar naik <snaik@attributor.com> wrote:
>>> Hi All,
>>> number  of map task: 1000s
>>> number of reduce task:single digit
>>> In such cases the reduce task wont  started even when few map task are
>>> completed.
>>> Example:
>>> In my observation of a sample run of bin/hadoop jar
>>> hadoop-*examples*.jar pi 10000 10, the reduce did not start untill 90%
>>> of map task were complete.
>>> The only reason, I can think of not starting  a reduce task is to
>>> avoid the un-necessary transfer of map output data in case of
>>> failures.
>>> Is there a way to quickly start the reduce task in such case ?
>>> Wht is the configuration param to change this behavior
>>> Thanks,
>>> Sagar

View raw message