hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Why single thread for HDFS?
Date Tue, 06 Jul 2010 15:30:40 GMT
Michael Segel wrote:
> Uhm...
> 
> That's not really true. It gets a bit more complicated than that.
> 
> If you're talking about M/R jobs, you don't want to do threads in your map() routine,
while this is possible, its going to be really hard to justify the extra parallelism along
with the need to wait for all of the threads to complete before you can end the map() method.

> 
> If you're talking about a way to copy files from one cluster to another... in hadoop...
you can find out the block lists that make up the file. As long as the file is static, meaning
no one is writing/spliting/compacting the file, you could copy it. Here being multi threaded
could work. 
> You'd have one thread per block that will read from one machine, and then write directly
to the other. Of course you'll need to figure out where to write the block, or rather tie
in to HDFS.

There's a paper by Russ Perry using HDFS as a filestore for raster 
processing, where he modified DfsClient to get all the locations of a 
file, and let the caller decide where to read blocks from.

http://www.hpl.hp.com/techreports/2009/HPL-2009-345.html

the advantage of this is that the caller can do the striping across 
machines, keep every server busy by asking for files from each of them. 
Of course, this ignores the trend to many-HDD servers; DfsClient can't 
currently see which physical disk a file is on, which you'd need if the 
client wanted to keep every disk on every server busy during a big read

Mime
View raw message