hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gautam Singaraju <gautam.singar...@gmail.com>
Subject Re: Why single thread for HDFS?
Date Tue, 06 Jul 2010 07:30:06 GMT
To add to Jay Booth's points, adding multi-threaded capability to HDFS
will bring down the performance. Consider a production server where
4-5 jobs are running on a low-end commodity server. Currently, that is
4 threads reading and writing from the hard disk. Making it a
multi-threaded read and write will create many threads (Number of Jobs
* Default HDFS Block size * 1024 KB/ file system block sizes). For a
low-end hard disk with limited RPM cycles, a higher number of threads
will decrease the performance. As the number of disk access increase
from 1, the throughput will increase. But after 3-4 parallel disk
accesses, the performance will start to decrease. You can use
performance analytics tools (like IOMeter) to identify the *ideal*
number of parallel disk accesses for a specified hard-disk.

---
Gautam



On Mon, Jul 5, 2010 at 8:46 PM, elton sky <eltonsky9404@gmail.com> wrote:
>>Basically, your point is that hadoop dfs -cp is relatively slow and could
> be made faster.  If HDFS had a more multi-threaded >design, itwould make cp
> operations faster.
> What I mean is, if we have the size of a file we can parallel by calculating
> blocks. Otherwise we couldn't.
>
>
> On Tue, Jul 6, 2010 at 10:47 AM, Allen Wittenauer
> <awittenauer@linkedin.com>wrote:
>
>>
>> On Jul 5, 2010, at 5:01 PM, elton sky wrote:
>> > Well, this sounds good when you have many small files, you concat() them
>> > into a big one. I am talking about split a big file into blocks and copy
>> all
>> > a few blocks in parallel.
>>
>> Basically, your point is that hadoop dfs -cp is relatively slow and could
>> be made faster.  If HDFS had a more multi-threaded design, it would make cp
>> operations faster.
>>
>> This sounds like a particularly high cost for an operation that is rarely
>> utilized.  [This is much more interesting in a distcp context, but even then
>> not that great.  distcp in my experience is usually used to push a bunch of
>> files, so you get your parallelism at the file level.  Typically these are
>> part files are usually the same approx. size.]
>>
>>
>>
>

Mime
View raw message