hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: Hadoop Design Question
Date Thu, 06 Nov 2008 17:51:22 GMT
On Nov 6, 2008, at 11:29 AM, Ricky Ho wrote:

> Disk I/O overhead
> ==================
> - The output of a Map task is written to a local disk and then later  
> on upload to the Reduce task.  While this enable a simple recovery  
> strategy when the map task failed, it incur additional disk I/O  
> overhead.

That is correct. However, Linux does very well at using extra ram for  
buffer caches, so as long as you enable write behind it won't be a  
performance problem. You are right that the primary motivation is both  
recoverability and not needing the reduces running until after maps  
finish.

>   So I am wondering if there is an option to bypassing the step of  
> writing the map result to the local disk.

Currently no.

> - In the current setting, it sounds like no reduce task will be  
> started before all map tasks have completed.  In case if there are a  
> few slow running map tasks, the whole job will be delayed.

The application's reduce function can't start until the last map  
finishes because the input to the reduce is sorted. Since the last map  
may generate the first keys that must be given to the reduce, the  
reduce must wait.

> - The overall job execution can be shortened if the reduce tasks can  
> starts its processing as soon as some map results are available  
> rather than waiting for all the map tasks to complete.

But it would violate the specification of the framework that the input  
to reduce is completely sorted.

> - Therefore it is impossible for the reduce phase of Job1 to stream  
> its output data to a file while the map phase of Job2 start reading  
> the same file.  Job2 can only start after ALL REDUCE TASKS of Job1  
> is completed, which makes pipelining between jobs impossible.

It is currently not supported, but the framework could be extended to  
let the client add input splits after the job has started. That would  
remove the hard synchronization between jobs.

> - This means the partitioning function has to be chosen carefully to  
> make sure the workload of the reduce processes is balanced.  (maybe  
> not a big deal)

Yes, the partitioner must balance the workload between the reduces.

> - Is there any thoughts of running a pool of reduce tasks on the  
> same key and have they combine their results later ?

That is called the combiner. It is called multiple times as the data  
is merged together. See the word count example. If the reducer does  
data reduction, using combiners is very important for performance.

-- Owen

Mime
View raw message