hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Gong <haog...@huawei.com>
Subject re: why did I achieve such poor performance of HDFS
Date Wed, 05 Aug 2009 01:42:31 GMT
Hi Konstantin and Raghu

  1. There may be a misunderstanding. We didn't trigger 5000 simultaneous
reads; we only start 10 clients and every client trigger 5000 random read
sequentially (one by one) by single-threaded, so a certain file only be read
by 10 client simultaneously. 
  And our network speed is 1000Mb/s, we test the throughput between any two
nodes is approximately 90MB/s. We also test the disk I/O by "iostat", and
the result is approximately 80MB/s, our disk is Seagate 7200.11 series
1.5TB.
  2. I will upload our test to JIRA as soon as possible.
  Thanks.

Best regards,
Hao Gong
Huawei Technologies Co., Ltd
***********************************************
This e-mail and its attachments contain confidential information from
HUAWEI, which is intended only for the person or entity whose address is
listed above. Any use of the information contained herein in any way
(including, but not limited to, total or partial disclosure, reproduction,
or dissemination) by persons other than the intended recipient(s) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!
***********************************************
-----邮件原件-----
发件人: Konstantin Boudnik [mailto:cos@yahoo-inc.com] 
发送时间: 2009年8月5日 0:07
收件人: hdfs-user@hadoop.apache.org
主题: Re: why did I achieve such poor performance of HDFS

And here's some reading you can find useful:
http://www.facebook.com/note.php?note_id=53035052002&ref=mf

On 8/4/09 8:52 AM, Konstantin Boudnik wrote:
> Hi Hao.
> 
> 
> One more question for you - I should've asked it in my first email,
though...
> What is your network speed/throughput on such a massive reads WITHOUT HDFS
in
> place? While I'm agree that ~14Kbps isn't that much at all, I was
wondering what
> would be the speed of 5000 simultaneous reads from a native file systems
over
> the same network?
> 
> Could such a test be congregated in your setup?
> 
> One more issue here is that in your first test the size of a file is
smaller
> than a default HDFS block size (64MB i think) and it is likely to create
> significant overhead and affect the performance.
> 
> 1) For a sharing of your current test you can simply create new JIRA under
> https://issues.apache.org/jira/browse/ under 'test' or simply send it to
me as
> an attachment and I'll take care about JIRA stuff. But I'd love to see the
> result of the other test I've mentioned above if possible.
> 
> 2) DFSClient does provide an API for random reads from a file and this API
is
> thread safe. However, my uneducated guess would be that it is likely to be
a
> responsibility of a client (your) problem to 'rebuild' the file from
randomly
> read block in correct order. It is like pretty much any other filesystem
out
> there: YOU have to know the sequence of the pieces of your file in order
to
> reconstruct them from many concurrent reads.
> 
> Hope it helps,
>    Konstantin
> 
> On 8/3/09 6:49 PM, Hao Gong wrote:
>> Hi Konstantin,
>>
>>     Thank you for your responsing.
>>     1. Yes. It is automated and can be reused easily by anyone, I think.
>> Because I didn't change the HDFS code and parameter except for the
parameter
>> of "hadoop.tmp.dir" and "fs.default.name".
>>     2. Yes. I can share our test with the community. How to do it now?
>>
>>     By the way, I have a little question about HDFS.
>>     1. HDFS client is a single-threaded or multi-threaded when it
transmit the
>> blocks of a certain file? I mean that for example, if file A, its size is
>> 256MB, it divide 4 blocks in 4 datanodes. When client PUT or GET this
file,
>> the operation is sequential (one block by one) or simultaneous (client
GET
>> the 4 block from 4 datanodes at the same time)?
>>     In client source, I used "FSDataInputStream.read(long position,
byte[]
>> buffer, int offset, int length)" to GET the file.
>>
>>     Thanks very much.
>>
>> Best regards,
>> Hao Gong
>> Huawei Technologies Co., Ltd
>> ***********************************************
>> This e-mail and its attachments contain confidential information from
>> HUAWEI, which is intended only for the person or entity whose address is
>> listed above. Any use of the information contained herein in any way
>> (including, but not limited to, total or partial disclosure,
reproduction,
>> or dissemination) by persons other than the intended recipient(s) is
>> prohibited. If you receive this e-mail in error, please notify the sender
by
>> phone or email immediately and delete it!
>> ***********************************************
>> -----邮件原件-----
>> 发件人: Konstantin Boudnik [mailto:cos@yahoo-inc.com]
>> 发送时间: 2009年8月4日 1:02
>> 收件人: hdfs-user@hadoop.apache.org
>> 主题: Re: why did I achieve such poor performance of HDFS
>>
>> Hi Hao.
>>
>> Thanks for the observation. While I'll leave a chance to comment on the
>> particular situation to someone knowing more about HDFS than me, I would
>> like to
>> ask you a couple of questions:
>>      - do you have that particular test in a completely separable form?
I.e.
>> is it
>> automated and can it be reused easily by some one else?
>>      - could you share this test with the rest of the community through a
JIRA
>> or
>> else?
>>
>> Thanks,
>>      Konstantin (aka Cos)
>>
>> On 8/3/09 12:59 AM, Hao Gong wrote:
>>> Hi all,
>>>
>>> I have used HDFS as distributed storage system for experiment. But in my
>>> test process, I find that the performance of HDFS is very poor.
>>>
>>> I make two scenarios. 1) Middle size file test: I PUT 200,000 middle
>>> size files (20KB~20MB randomly) into HDFS, and trigger 10 client to GET
>>> random 5000 files simultaneously. But the average GET throughput of
>>> client is very poor (approximately less than 14000 KBps). 2) Large size
>>> file test. I PUT 20,000 large size files (250MB~750MB randomly) into
>>> HDFS, and trigger 10 client to GET random 100 files simultaneously. But
>>> the average GET throughput of client is also very poor (approximately
>>> less than 12500 KBps).
>>>
>>> So I’m puzzle about these experiments, why did such a poor performance
>>> of HDFS, the available throughput of Client is far less than the limit
>>> of network bandwidth. Is that has any parameter I need to change for
>>> high performance in HDFS (I chose default parameter value)?
>>>
>>> My enviroment is list as follows
>>>
>>> 1) 30 common PC as HDFS slaves (core2 E7200, 4G ram, 1.5T hdd)
>>>
>>> 2) 10 common PC as HDFS clients (core2 E7200, 4G ram, 1.5T hdd)
>>>
>>> 3) A common PC as HDFS master (core2 E7200, 4G ram, 1.5T hdd)
>>>
>>> 4) 1000M switcher and link as star network architecture
>>>
>>> 5) The hadoop version is 0.20.0, JRE version is 1.6.0_11
>>>
>>> Is there has anybody to research the performance of HDFS, please contact
>>> me. Thank you very much.
>>>
>>> Best regards,
>>>
>>> Hao Gong
>>>
>>> Huawei Technologies Co., Ltd
>>> ***********************************************
>>> This e-mail and its attachments contain confidential information from
>>> HUAWEI, which is intended only for the person or entity whose address is
>>> listed above. Any use of the information contained herein in any way
>>> (including, but not limited to, total or partial disclosure,
>>> reproduction, or dissemination) by persons other than the intended
>>> recipient(s) is prohibited. If you receive this e-mail in error, please
>>> notify the sender by phone or email immediately and delete it!
>>> ***********************************************
>>>
> 

-- 
With best regards,
	Konstantin Boudnik (aka Cos)

        Yahoo! Grid Computing
        +1 (408) 349-4049

2CAC 8312 4870 D885 8616  6115 220F 6980 1F27 E622
Attention! Streams of consciousness are disallowed


Mime
View raw message