hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@math.unl.edu>
Subject Circumventing Hadoop's data placement policy
Date Sat, 23 May 2009 17:13:44 GMT
Hey all,

Had a problem I wanted to ask advice on.  The Caltech site I work with  
currently have a few GridFTP servers which are on the same physical  
machines as the Hadoop datanodes, and a few that aren't.  The GridFTP  
server has a libhdfs backend which writes incoming network data into  
HDFS.

They've found that the GridFTP servers which are co-located with HDFS  
datanode have poor performance because data is incoming at a much  
faster rate than the HDD can handle.  The standalone GridFTP servers,  
however, push data out to multiple nodes at one, and can handle the  
incoming data just fine (>200MB/s).

Is there any way to turn off the preference for the local node?  Can  
anyone think of a good workaround to trick HDFS into thinking the  
client isn't on the same node?

Brian
Mime
View raw message