hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Halim <felix.ha...@gmail.com>
Subject Re: Barrier between reduce and map of the next round
Date Tue, 09 Feb 2010 05:49:52 GMT
Hi,

Currently the barrier between r(i) and m(i+1) is the Job barrier.
That is, m(i+1) will be blocked until all r(i) finish (until Job i finish).

I'm saying this blocking is not necessary if we can concatenate them
all in a single Job as an endless chain.
Therefore m(i+1) can start immediately even when r(i) is not finished.

The termination condition is when some counter after r(i) is finished is zero.
Thus the result of m(i+1) is discarded.

I don't know how to make it clearer than this...

Felix Halim

On Tue, Feb 9, 2010 at 1:41 PM, Amogh Vasekar <amogh@yahoo-inc.com> wrote:
> Hi,
>>>m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
> My understanding is it would be something like:
> m1|(r1 m2)| m(identity) | r2, if you combine the r(i) and m(i+1), because of
> the hard distinction between Rs & Ms.
>
> Amogh
>
>
> On 2/4/10 1:46 PM, "Felix Halim" <felix.halim@gmail.com> wrote:
>
> Talking about barrier, currently there are barriers between anything:
>
> m1 | r1 | m2 | r2 | ... | mK | rK
>
> where | is the barrier.
>
> I'm saying that the barrier between ri and m(i+1) is not necessary.
> So it should go like this:
>
> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
>
> Here the result of m(K+1) is throwed away.
> We take the result of rK only.
>
> The shuffling is needed only between mi and ri.
> There is no shuffling needed for ri and m(i+1).
>
> Thus by removing the barrier between ri and m(i+1), the overall job
> can be made faster.
>
> Now the question is, can this be done using Chaining?
> AFAIK, the chaining has to be defined before the job is started, right?
> But because I don't know the value of K beforehand,
> I want the chain to continue forever until some counter in reduce task is
> zero.
>
> Felix Halim
>
>
> On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar <amogh@yahoo-inc.com> wrote:
>>
>>>>However, from ri to m(i+1) there is an unnecessary barrier. m(i+1) should
>>>> not need to wait for all reducers ri to finish, right?
>>
>> Yes, but r(i+1) cant be in the same job, since that requires another sort
>> and shuffle phase ( barrier ). So you would end up doing, job(i) :
>> m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is assuming
>> you cant do r(i+1) in m(identity), for if you can then it doesn’t need
>> sort
>> and shuffle , and hence your job would be again of the form m+rm* :)
>>
>> Amogh
>>
>> On 2/4/10 10:19 AM, "Felix Halim" <felix.halim@gmail.com> wrote:
>>
>> Hi Ed,
>>
>> Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
>> barrier between mi and ri is acceptable since reducer has to wait for
>> all map task to finish. However, from ri to m(i+1) there is an
>> unnecessary barrier. m(i+1) should not need to wait for all reducers
>> ri to finish, right?
>>
>> Currently, I created one Job for each mi,ri. So I have total of K
>> jobs. Is there a way to chain them all together into a single Job?
>> However, I don't know the value of K in advance. It has to be checked
>> after each ri.  So I'm thinking that the job can speculatively do the
>> chain over and over until it discover that some counter in ri is zero
>> (so the result of m(K+1) is discarded, and the final result of rK is
>> taken).
>>
>> Felix Halim
>>
>>
>> On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <mazur@cs.umass.edu> wrote:
>>> Felix,
>>>
>>> You can use ChainMapper and ChainReducer to create jobs of the form
>>> M+RM*. Is that what you're looking for? I'm not aware of anything that
>>> allows you to have multiple reduce functions without the job
>>> "barrier".
>>>
>>> Ed
>>>
>>> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <felix.halim@gmail.com>
>>> wrote:
>>>> Hi all,
>>>>
>>>> As far as I know, a barrier exists between map and reduce function in
>>>> one round of MR. There is another barrier for the reducer to end the
>>>> job for that round. However if we want to run in several rounds using
>>>> the same map and reduce functions, then the barrier between reduce and
>>>> the map of the next round is NOT necessary, right? Since the reducer
>>>> only output a single value for each key. This reducer may as well run
>>>> a map task for the next round immediately rather than waiting for all
>>>> reducer to finish. This way, the utilization of the machines between
>>>> rounds can be improved.
>>>>
>>>> Is there a setting in Hadoop to do that?
>>>>
>>>> Felix Halim
>>>>
>>>
>>
>>
>
>

Mime
View raw message