hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Why single thread for HDFS?
Date Tue, 06 Jul 2010 14:06:07 GMT


That's not really true. It gets a bit more complicated than that.

If you're talking about M/R jobs, you don't want to do threads in your map() routine, while
this is possible, its going to be really hard to justify the extra parallelism along with
the need to wait for all of the threads to complete before you can end the map() method. 

If you're talking about a way to copy files from one cluster to another... in hadoop... you
can find out the block lists that make up the file. As long as the file is static, meaning
no one is writing/spliting/compacting the file, you could copy it. Here being multi threaded
could work. 
You'd have one thread per block that will read from one machine, and then write directly to
the other. Of course you'll need to figure out where to write the block, or rather tie in
to HDFS.

This is more complex than a M/R job but will work.

If you're reading from the cloud and then writing to the UNIX file system, you want to write
the blocks in serial order. (KISS).



> Date: Tue, 6 Jul 2010 00:30:06 -0700
> Subject: Re: Why single thread for HDFS?
> From: gautam.singaraju@gmail.com
> To: general@hadoop.apache.org
> To add to Jay Booth's points, adding multi-threaded capability to HDFS
> will bring down the performance. Consider a production server where
> 4-5 jobs are running on a low-end commodity server. Currently, that is
> 4 threads reading and writing from the hard disk. Making it a
> multi-threaded read and write will create many threads (Number of Jobs
> * Default HDFS Block size * 1024 KB/ file system block sizes). For a
> low-end hard disk with limited RPM cycles, a higher number of threads
> will decrease the performance. As the number of disk access increase
> from 1, the throughput will increase. But after 3-4 parallel disk
> accesses, the performance will start to decrease. You can use
> performance analytics tools (like IOMeter) to identify the *ideal*
> number of parallel disk accesses for a specified hard-disk.
> ---
> Gautam
> On Mon, Jul 5, 2010 at 8:46 PM, elton sky <eltonsky9404@gmail.com> wrote:
> >>Basically, your point is that hadoop dfs -cp is relatively slow and could
> > be made faster.  If HDFS had a more multi-threaded >design, itwould make cp
> > operations faster.
> > What I mean is, if we have the size of a file we can parallel by calculating
> > blocks. Otherwise we couldn't.
> >
> >
> > On Tue, Jul 6, 2010 at 10:47 AM, Allen Wittenauer
> > <awittenauer@linkedin.com>wrote:
> >
> >>
> >> On Jul 5, 2010, at 5:01 PM, elton sky wrote:
> >> > Well, this sounds good when you have many small files, you concat() them
> >> > into a big one. I am talking about split a big file into blocks and copy
> >> all
> >> > a few blocks in parallel.
> >>
> >> Basically, your point is that hadoop dfs -cp is relatively slow and could
> >> be made faster.  If HDFS had a more multi-threaded design, it would make cp
> >> operations faster.
> >>
> >> This sounds like a particularly high cost for an operation that is rarely
> >> utilized.  [This is much more interesting in a distcp context, but even then
> >> not that great.  distcp in my experience is usually used to push a bunch of
> >> files, so you get your parallelism at the file level.  Typically these are
> >> part files are usually the same approx. size.]
> >>
> >>
> >>
> >
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message