hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject Re: Barrier between reduce and map of the next round
Date Tue, 09 Feb 2010 05:41:18 GMT
Hi,
>>m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
My understanding is it would be something like:
m1|(r1 m2)| m(identity) | r2, if you combine the r(i) and m(i+1), because of the hard distinction
between Rs & Ms.

Amogh


On 2/4/10 1:46 PM, "Felix Halim" <felix.halim@gmail.com> wrote:

Talking about barrier, currently there are barriers between anything:

m1 | r1 | m2 | r2 | ... | mK | rK

where | is the barrier.

I'm saying that the barrier between ri and m(i+1) is not necessary.
So it should go like this:

m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)

Here the result of m(K+1) is throwed away.
We take the result of rK only.

The shuffling is needed only between mi and ri.
There is no shuffling needed for ri and m(i+1).

Thus by removing the barrier between ri and m(i+1), the overall job
can be made faster.

Now the question is, can this be done using Chaining?
AFAIK, the chaining has to be defined before the job is started, right?
But because I don't know the value of K beforehand,
I want the chain to continue forever until some counter in reduce task is zero.

Felix Halim


On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar <amogh@yahoo-inc.com> wrote:
>
>>>However, from ri to m(i+1) there is an unnecessary barrier. m(i+1) should
>>> not need to wait for all reducers ri to finish, right?
>
> Yes, but r(i+1) cant be in the same job, since that requires another sort
> and shuffle phase ( barrier ). So you would end up doing, job(i) :
> m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is assuming
> you cant do r(i+1) in m(identity), for if you can then it doesn't need sort
> and shuffle , and hence your job would be again of the form m+rm* :)
>
> Amogh
>
> On 2/4/10 10:19 AM, "Felix Halim" <felix.halim@gmail.com> wrote:
>
> Hi Ed,
>
> Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
> barrier between mi and ri is acceptable since reducer has to wait for
> all map task to finish. However, from ri to m(i+1) there is an
> unnecessary barrier. m(i+1) should not need to wait for all reducers
> ri to finish, right?
>
> Currently, I created one Job for each mi,ri. So I have total of K
> jobs. Is there a way to chain them all together into a single Job?
> However, I don't know the value of K in advance. It has to be checked
> after each ri.  So I'm thinking that the job can speculatively do the
> chain over and over until it discover that some counter in ri is zero
> (so the result of m(K+1) is discarded, and the final result of rK is
> taken).
>
> Felix Halim
>
>
> On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <mazur@cs.umass.edu> wrote:
>> Felix,
>>
>> You can use ChainMapper and ChainReducer to create jobs of the form
>> M+RM*. Is that what you're looking for? I'm not aware of anything that
>> allows you to have multiple reduce functions without the job
>> "barrier".
>>
>> Ed
>>
>> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <felix.halim@gmail.com> wrote:
>>> Hi all,
>>>
>>> As far as I know, a barrier exists between map and reduce function in
>>> one round of MR. There is another barrier for the reducer to end the
>>> job for that round. However if we want to run in several rounds using
>>> the same map and reduce functions, then the barrier between reduce and
>>> the map of the next round is NOT necessary, right? Since the reducer
>>> only output a single value for each key. This reducer may as well run
>>> a map task for the next round immediately rather than waiting for all
>>> reducer to finish. This way, the utilization of the machines between
>>> rounds can be improved.
>>>
>>> Is there a setting in Hadoop to do that?
>>>
>>> Felix Halim
>>>
>>
>
>


Mime
View raw message