hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@yahoo-inc.com>
Subject Re: Barrier between reduce and map of the next round
Date Tue, 09 Feb 2010 19:04:39 GMT
Felix, you might want to follow https://issues.apache.org/jira/browse/MAPREDUCE-1434 
.
We are discussing ideas very similar to what you've just described  
over there.

Arun

On Feb 8, 2010, at 9:49 PM, Felix Halim wrote:

> Hi,
>
> Currently the barrier between r(i) and m(i+1) is the Job barrier.
> That is, m(i+1) will be blocked until all r(i) finish (until Job i  
> finish).
>
> I'm saying this blocking is not necessary if we can concatenate them
> all in a single Job as an endless chain.
> Therefore m(i+1) can start immediately even when r(i) is not finished.
>
> The termination condition is when some counter after r(i) is  
> finished is zero.
> Thus the result of m(i+1) is discarded.
>
> I don't know how to make it clearer than this...
>
> Felix Halim
>
> On Tue, Feb 9, 2010 at 1:41 PM, Amogh Vasekar <amogh@yahoo-inc.com>  
> wrote:
>> Hi,
>>>> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
>> My understanding is it would be something like:
>> m1|(r1 m2)| m(identity) | r2, if you combine the r(i) and m(i+1),  
>> because of
>> the hard distinction between Rs & Ms.
>>
>> Amogh
>>
>>
>> On 2/4/10 1:46 PM, "Felix Halim" <felix.halim@gmail.com> wrote:
>>
>> Talking about barrier, currently there are barriers between anything:
>>
>> m1 | r1 | m2 | r2 | ... | mK | rK
>>
>> where | is the barrier.
>>
>> I'm saying that the barrier between ri and m(i+1) is not necessary.
>> So it should go like this:
>>
>> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
>>
>> Here the result of m(K+1) is throwed away.
>> We take the result of rK only.
>>
>> The shuffling is needed only between mi and ri.
>> There is no shuffling needed for ri and m(i+1).
>>
>> Thus by removing the barrier between ri and m(i+1), the overall job
>> can be made faster.
>>
>> Now the question is, can this be done using Chaining?
>> AFAIK, the chaining has to be defined before the job is started,  
>> right?
>> But because I don't know the value of K beforehand,
>> I want the chain to continue forever until some counter in reduce  
>> task is
>> zero.
>>
>> Felix Halim
>>
>>
>> On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar <amogh@yahoo-inc.com>  
>> wrote:
>>>
>>>>> However, from ri to m(i+1) there is an unnecessary barrier. m(i 
>>>>> +1) should
>>>>> not need to wait for all reducers ri to finish, right?
>>>
>>> Yes, but r(i+1) cant be in the same job, since that requires  
>>> another sort
>>> and shuffle phase ( barrier ). So you would end up doing, job(i) :
>>> m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is  
>>> assuming
>>> you cant do r(i+1) in m(identity), for if you can then it doesn’t  
>>> need
>>> sort
>>> and shuffle , and hence your job would be again of the form m+rm* :)
>>>
>>> Amogh
>>>
>>> On 2/4/10 10:19 AM, "Felix Halim" <felix.halim@gmail.com> wrote:
>>>
>>> Hi Ed,
>>>
>>> Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
>>> barrier between mi and ri is acceptable since reducer has to wait  
>>> for
>>> all map task to finish. However, from ri to m(i+1) there is an
>>> unnecessary barrier. m(i+1) should not need to wait for all reducers
>>> ri to finish, right?
>>>
>>> Currently, I created one Job for each mi,ri. So I have total of K
>>> jobs. Is there a way to chain them all together into a single Job?
>>> However, I don't know the value of K in advance. It has to be  
>>> checked
>>> after each ri.  So I'm thinking that the job can speculatively do  
>>> the
>>> chain over and over until it discover that some counter in ri is  
>>> zero
>>> (so the result of m(K+1) is discarded, and the final result of rK is
>>> taken).
>>>
>>> Felix Halim
>>>
>>>
>>> On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <mazur@cs.umass.edu>  
>>> wrote:
>>>> Felix,
>>>>
>>>> You can use ChainMapper and ChainReducer to create jobs of the form
>>>> M+RM*. Is that what you're looking for? I'm not aware of anything  
>>>> that
>>>> allows you to have multiple reduce functions without the job
>>>> "barrier".
>>>>
>>>> Ed
>>>>
>>>> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <felix.halim@gmail.com>
>>>> wrote:
>>>>> Hi all,
>>>>>
>>>>> As far as I know, a barrier exists between map and reduce  
>>>>> function in
>>>>> one round of MR. There is another barrier for the reducer to end  
>>>>> the
>>>>> job for that round. However if we want to run in several rounds  
>>>>> using
>>>>> the same map and reduce functions, then the barrier between  
>>>>> reduce and
>>>>> the map of the next round is NOT necessary, right? Since the  
>>>>> reducer
>>>>> only output a single value for each key. This reducer may as  
>>>>> well run
>>>>> a map task for the next round immediately rather than waiting  
>>>>> for all
>>>>> reducer to finish. This way, the utilization of the machines  
>>>>> between
>>>>> rounds can be improved.
>>>>>
>>>>> Is there a setting in Hadoop to do that?
>>>>>
>>>>> Felix Halim
>>>>>
>>>>
>>>
>>>
>>
>>


Mime
View raw message