hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer ...@apache.org>
Subject Re: how does hdfs determine what node to use?
Date Mon, 14 Mar 2011 16:28:58 GMT

On Mar 10, 2011, at 10:34 AM, Jeffrey Buell wrote:

> Rita said that she has 2 racks (not 2 nodes).  Rita, how many nodes per rack do you have?
> 
> To continue the thread, could there be a performance advantage to having greater replication
in the shuffle or reduce phases?  That is, is hadoop smart enough that when it needs data
that are not on the local node, it finds out which copy of that data is on the closest (in
the network sense) node and gets it from there?  

	The reduce phase doesn't read from HDFS.   It does the equiv. of a  HTTP get from the tasktracker
that hold the map's intermediate output.  The speed up here is that the reduce should get
scheduled on the same node that one of the job's mapper tasks was scheduled, especially any
hosts that have significant map output.  This could potentially reduce network usage, but
in the end is likely to be insignificant.
Mime
View raw message