hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elton sky <eltonsky9...@gmail.com>
Subject Re: Why single thread for HDFS?
Date Wed, 07 Jul 2010 02:10:10 GMT

Seems HP has done block based parallel reading from different datanodes.
Though not from disk level, they achieve 4Gb/s rate with 9 readers (500Mb/s
I didn't see anywhere I can download their code to play around, pity~

BTW, can we specify which disk to read from with Java?

On Wed, Jul 7, 2010 at 1:30 AM, Steve Loughran <stevel@apache.org> wrote:

> Michael Segel wrote:
>> Uhm...
>> That's not really true. It gets a bit more complicated than that.
>> If you're talking about M/R jobs, you don't want to do threads in your
>> map() routine, while this is possible, its going to be really hard to
>> justify the extra parallelism along with the need to wait for all of the
>> threads to complete before you can end the map() method.
>> If you're talking about a way to copy files from one cluster to another...
>> in hadoop... you can find out the block lists that make up the file. As long
>> as the file is static, meaning no one is writing/spliting/compacting the
>> file, you could copy it. Here being multi threaded could work. You'd have
>> one thread per block that will read from one machine, and then write
>> directly to the other. Of course you'll need to figure out where to write
>> the block, or rather tie in to HDFS.
> There's a paper by Russ Perry using HDFS as a filestore for raster
> processing, where he modified DfsClient to get all the locations of a file,
> and let the caller decide where to read blocks from.
> http://www.hpl.hp.com/techreports/2009/HPL-2009-345.html
> the advantage of this is that the caller can do the striping across
> machines, keep every server busy by asking for files from each of them. Of
> course, this ignores the trend to many-HDD servers; DfsClient can't
> currently see which physical disk a file is on, which you'd need if the
> client wanted to keep every disk on every server busy during a big read

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message