hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: Hadoop Internal Architecture writeup
Date Fri, 28 Nov 2008 19:44:21 GMT

On Nov 28, 2008, at 9:45 AM, Ricky Ho wrote:

> [Ricky]  What exactly does the job.split contains ?  I assume it  
> contains the specification for each split (but not its data), such  
> as what is the corresponding file and the byte range within that  
> file.  Correct ?

Yes

> [Ricky]  I am curious about why can't the reduce execution start  
> earlier (before all the map tasks completed).

The contract is that each reduce is given the keys in sorted order.  
The reduce can't start until it is sure it has the first key. That can  
only happen after the maps are all finished.

> [Ricky]  Do you mean if the job has 5000 splits, then it requires  
> 5000 TaskTrackers VM (one for each split) ?

In 0.19, you can enable the framework to re-use jvms between tasks in  
the same job. Look at HADOOP-249.

> [Ricky]  Is this a well-know folder within the HDFS ?

It is configured by the cluster admin. However, applications should  
*not* depend on the contents or even visibility of that directory. It  
will almost certainly become inaccessible to clients as part of  
increasing security.

-- Owen

Mime
View raw message