hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Why single thread for HDFS?
Date Tue, 06 Jul 2010 14:10:46 GMT


If all you want to do is to have a faster -cp option, then if you know your intial block list
and location, you need to generate the target bloc list and then create a single thread per
block and process each block in a separate thread.

You don't need to use the local disk and just read/write each block in 'paged' increments.
(pages as in 4/16/32/64K page sizes.) 
(This removes the i/o argument raised by another poster.)

This may be faster than the current process.

HTH

-Mike

> Date: Tue, 6 Jul 2010 13:46:34 +1000
> Subject: Re: Why single thread for HDFS?
> From: eltonsky9404@gmail.com
> To: general@hadoop.apache.org
> 
> >Basically, your point is that hadoop dfs -cp is relatively slow and could
> be made faster.  If HDFS had a more multi-threaded >design, itwould make cp
> operations faster.
> What I mean is, if we have the size of a file we can parallel by calculating
> blocks. Otherwise we couldn't.
> 
> 
> On Tue, Jul 6, 2010 at 10:47 AM, Allen Wittenauer
> <awittenauer@linkedin.com>wrote:
> 
> >
> > On Jul 5, 2010, at 5:01 PM, elton sky wrote:
> > > Well, this sounds good when you have many small files, you concat() them
> > > into a big one. I am talking about split a big file into blocks and copy
> > all
> > > a few blocks in parallel.
> >
> > Basically, your point is that hadoop dfs -cp is relatively slow and could
> > be made faster.  If HDFS had a more multi-threaded design, it would make cp
> > operations faster.
> >
> > This sounds like a particularly high cost for an operation that is rarely
> > utilized.  [This is much more interesting in a distcp context, but even then
> > not that great.  distcp in my experience is usually used to push a bunch of
> > files, so you get your parallelism at the file level.  Typically these are
> > part files are usually the same approx. size.]
> >
> >
> >
 		 	   		  
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message