hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject Re: Barrier between reduce and map of the next round
Date Thu, 04 Feb 2010 07:53:15 GMT

>>However, from ri to m(i+1) there is an unnecessary barrier. m(i+1) should not need
to wait for all reducers ri to finish, right?

Yes, but r(i+1) cant be in the same job, since that requires another sort and shuffle phase
( barrier ). So you would end up doing, job(i) : m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1).
Ofcourse, this is assuming you cant do r(i+1) in m(identity), for if you can then it doesn't
need sort and shuffle , and hence your job would be again of the form m+rm* :)

Amogh

On 2/4/10 10:19 AM, "Felix Halim" <felix.halim@gmail.com> wrote:

Hi Ed,

Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
barrier between mi and ri is acceptable since reducer has to wait for
all map task to finish. However, from ri to m(i+1) there is an
unnecessary barrier. m(i+1) should not need to wait for all reducers
ri to finish, right?

Currently, I created one Job for each mi,ri. So I have total of K
jobs. Is there a way to chain them all together into a single Job?
However, I don't know the value of K in advance. It has to be checked
after each ri.  So I'm thinking that the job can speculatively do the
chain over and over until it discover that some counter in ri is zero
(so the result of m(K+1) is discarded, and the final result of rK is
taken).

Felix Halim


On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <mazur@cs.umass.edu> wrote:
> Felix,
>
> You can use ChainMapper and ChainReducer to create jobs of the form
> M+RM*. Is that what you're looking for? I'm not aware of anything that
> allows you to have multiple reduce functions without the job
> "barrier".
>
> Ed
>
> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <felix.halim@gmail.com> wrote:
>> Hi all,
>>
>> As far as I know, a barrier exists between map and reduce function in
>> one round of MR. There is another barrier for the reducer to end the
>> job for that round. However if we want to run in several rounds using
>> the same map and reduce functions, then the barrier between reduce and
>> the map of the next round is NOT necessary, right? Since the reducer
>> only output a single value for each key. This reducer may as well run
>> a map task for the next round immediately rather than waiting for all
>> reducer to finish. This way, the utilization of the machines between
>> rounds can be improved.
>>
>> Is there a setting in Hadoop to do that?
>>
>> Felix Halim
>>
>


Mime
View raw message