Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of eltonsky9404@gmail.com
 designates 209.85.216.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=WMKNki9qZ/jqN6Y6jz4G3+nACLlNx6XNK+xisM4vOQYEWydXqGlGxr9VlmCiFpwGE0
         PsXF07mmQWUg/ujy9cXWKcUsVlrmwSMTpIHO0keiOTKkkbWcx1yj5buNY+OxfbHftQ9M
         xko6IEPVHfEO1oIbJq5zYu86XIAw16Mf7QxUg=
MIME-Version: 1.0
In-Reply-To: <4C334C20.9010105@apache.org>
References: <AANLkTiljgFWtvH2dQHhB44-8mXQKlAhtOPp5N1K3tJ3S@mail.gmail.com>
	<AANLkTikCO_TnPI-xBt_QQljWkTBWJ4lq3nGhHPSEjYNr@mail.gmail.com>
	<ABC24175AFD3BE4DA15F4CD375ED413D06084AF156@hq-ex-mb02.ad.navteq.com>
	<AANLkTilk2wfZ10leX0iEeNwJozMDdfHNiQ09rEAnbPDd@mail.gmail.com>
	<AANLkTimmN23EV17S1g-2LITHQrasMmh6qbSj13CJyfb5@mail.gmail.com>
	<AANLkTikNr4C2mjzLSwxUFglxtbfEalCwQ8clyzCNl_8a@mail.gmail.com>
	<AANLkTinC804Q953e4_4x80U_3l36EUERH4UNOz1I30Ww@mail.gmail.com>
	<5C9D1E1E-8299-4E97-B552-2DF69F5231BE@linkedin.com>
	<AANLkTimp7y0EYCThQ4WTKx6lW6BGYgkHHEocI3EV94q_@mail.gmail.com>
	<AANLkTinHouGalTJqqemJEb55fXme9qXaG0cg4rI0X7I-@mail.gmail.com>
	<COL117-W1485F3B9503653BB189B938FB20@phx.gbl>
	<4C334C20.9010105@apache.org>
Date: Wed, 7 Jul 2010 12:10:10 +1000
Message-ID: <AANLkTik7k_H6_1rtDiY5e_fxwpX7l9HrB1wvb6CD6iD8@mail.gmail.com>
Subject: Re: Why single thread for HDFS?
From: elton sky <eltonsky9404@gmail.com>
To: general@hadoop.apache.org
Content-Type: multipart/alternative; boundary=00c09f88d2796a0155048ac2abad

--00c09f88d2796a0155048ac2abad
Content-Type: text/plain; charset=ISO-8859-1

Steve,

Seems HP has done block based parallel reading from different datanodes.
Though not from disk level, they achieve 4Gb/s rate with 9 readers (500Mb/s
each).
I didn't see anywhere I can download their code to play around, pity~

BTW, can we specify which disk to read from with Java?

On Wed, Jul 7, 2010 at 1:30 AM, Steve Loughran <stevel@apache.org> wrote:

> Michael Segel wrote:
>
>> Uhm...
>>
>> That's not really true. It gets a bit more complicated than that.
>>
>> If you're talking about M/R jobs, you don't want to do threads in your
>> map() routine, while this is possible, its going to be really hard to
>> justify the extra parallelism along with the need to wait for all of the
>> threads to complete before you can end the map() method.
>> If you're talking about a way to copy files from one cluster to another...
>> in hadoop... you can find out the block lists that make up the file. As long
>> as the file is static, meaning no one is writing/spliting/compacting the
>> file, you could copy it. Here being multi threaded could work. You'd have
>> one thread per block that will read from one machine, and then write
>> directly to the other. Of course you'll need to figure out where to write
>> the block, or rather tie in to HDFS.
>>
>
> There's a paper by Russ Perry using HDFS as a filestore for raster
> processing, where he modified DfsClient to get all the locations of a file,
> and let the caller decide where to read blocks from.
>
> http://www.hpl.hp.com/techreports/2009/HPL-2009-345.html
>
> the advantage of this is that the caller can do the striping across
> machines, keep every server busy by asking for files from each of them. Of
> course, this ignores the trend to many-HDD servers; DfsClient can't
> currently see which physical disk a file is on, which you'd need if the
> client wanted to keep every disk on every server busy during a big read
>

--00c09f88d2796a0155048ac2abad--