hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Boudnik <...@yahoo-inc.com>
Subject Re: why did I achieve such poor performance of HDFS
Date Tue, 04 Aug 2009 16:06:33 GMT
And here's some reading you can find useful:
http://www.facebook.com/note.php?note_id=53035052002&ref=mf

On 8/4/09 8:52 AM, Konstantin Boudnik wrote:
> Hi Hao.
> 
> 
> One more question for you - I should've asked it in my first email, though...
> What is your network speed/throughput on such a massive reads WITHOUT HDFS in
> place? While I'm agree that ~14Kbps isn't that much at all, I was wondering what
> would be the speed of 5000 simultaneous reads from a native file systems over
> the same network?
> 
> Could such a test be congregated in your setup?
> 
> One more issue here is that in your first test the size of a file is smaller
> than a default HDFS block size (64MB i think) and it is likely to create
> significant overhead and affect the performance.
> 
> 1) For a sharing of your current test you can simply create new JIRA under
> https://issues.apache.org/jira/browse/ under 'test' or simply send it to me as
> an attachment and I'll take care about JIRA stuff. But I'd love to see the
> result of the other test I've mentioned above if possible.
> 
> 2) DFSClient does provide an API for random reads from a file and this API is
> thread safe. However, my uneducated guess would be that it is likely to be a
> responsibility of a client (your) problem to 'rebuild' the file from randomly
> read block in correct order. It is like pretty much any other filesystem out
> there: YOU have to know the sequence of the pieces of your file in order to
> reconstruct them from many concurrent reads.
> 
> Hope it helps,
>    Konstantin
> 
> On 8/3/09 6:49 PM, Hao Gong wrote:
>> Hi Konstantin,
>>
>>     Thank you for your responsing.
>>     1. Yes. It is automated and can be reused easily by anyone, I think.
>> Because I didn't change the HDFS code and parameter except for the parameter
>> of "hadoop.tmp.dir" and "fs.default.name".
>>     2. Yes. I can share our test with the community. How to do it now?
>>
>>     By the way, I have a little question about HDFS.
>>     1. HDFS client is a single-threaded or multi-threaded when it transmit the
>> blocks of a certain file? I mean that for example, if file A, its size is
>> 256MB, it divide 4 blocks in 4 datanodes. When client PUT or GET this file,
>> the operation is sequential (one block by one) or simultaneous (client GET
>> the 4 block from 4 datanodes at the same time)?
>>     In client source, I used "FSDataInputStream.read(long position, byte[]
>> buffer, int offset, int length)" to GET the file.
>>
>>     Thanks very much.
>>
>> Best regards,
>> Hao Gong
>> Huawei Technologies Co., Ltd
>> ***********************************************
>> This e-mail and its attachments contain confidential information from
>> HUAWEI, which is intended only for the person or entity whose address is
>> listed above. Any use of the information contained herein in any way
>> (including, but not limited to, total or partial disclosure, reproduction,
>> or dissemination) by persons other than the intended recipient(s) is
>> prohibited. If you receive this e-mail in error, please notify the sender by
>> phone or email immediately and delete it!
>> ***********************************************
>> -----邮件原件-----
>> 发件人: Konstantin Boudnik [mailto:cos@yahoo-inc.com]
>> 发送时间: 2009年8月4日 1:02
>> 收件人: hdfs-user@hadoop.apache.org
>> 主题: Re: why did I achieve such poor performance of HDFS
>>
>> Hi Hao.
>>
>> Thanks for the observation. While I'll leave a chance to comment on the
>> particular situation to someone knowing more about HDFS than me, I would
>> like to
>> ask you a couple of questions:
>>      - do you have that particular test in a completely separable form? I.e.
>> is it
>> automated and can it be reused easily by some one else?
>>      - could you share this test with the rest of the community through a JIRA
>> or
>> else?
>>
>> Thanks,
>>      Konstantin (aka Cos)
>>
>> On 8/3/09 12:59 AM, Hao Gong wrote:
>>> Hi all,
>>>
>>> I have used HDFS as distributed storage system for experiment. But in my
>>> test process, I find that the performance of HDFS is very poor.
>>>
>>> I make two scenarios. 1) Middle size file test: I PUT 200,000 middle
>>> size files (20KB~20MB randomly) into HDFS, and trigger 10 client to GET
>>> random 5000 files simultaneously. But the average GET throughput of
>>> client is very poor (approximately less than 14000 KBps). 2) Large size
>>> file test. I PUT 20,000 large size files (250MB~750MB randomly) into
>>> HDFS, and trigger 10 client to GET random 100 files simultaneously. But
>>> the average GET throughput of client is also very poor (approximately
>>> less than 12500 KBps).
>>>
>>> So I’m puzzle about these experiments, why did such a poor performance
>>> of HDFS, the available throughput of Client is far less than the limit
>>> of network bandwidth. Is that has any parameter I need to change for
>>> high performance in HDFS (I chose default parameter value)?
>>>
>>> My enviroment is list as follows
>>>
>>> 1) 30 common PC as HDFS slaves (core2 E7200, 4G ram, 1.5T hdd)
>>>
>>> 2) 10 common PC as HDFS clients (core2 E7200, 4G ram, 1.5T hdd)
>>>
>>> 3) A common PC as HDFS master (core2 E7200, 4G ram, 1.5T hdd)
>>>
>>> 4) 1000M switcher and link as star network architecture
>>>
>>> 5) The hadoop version is 0.20.0, JRE version is 1.6.0_11
>>>
>>> Is there has anybody to research the performance of HDFS, please contact
>>> me. Thank you very much.
>>>
>>> Best regards,
>>>
>>> Hao Gong
>>>
>>> Huawei Technologies Co., Ltd
>>> ***********************************************
>>> This e-mail and its attachments contain confidential information from
>>> HUAWEI, which is intended only for the person or entity whose address is
>>> listed above. Any use of the information contained herein in any way
>>> (including, but not limited to, total or partial disclosure,
>>> reproduction, or dissemination) by persons other than the intended
>>> recipient(s) is prohibited. If you receive this e-mail in error, please
>>> notify the sender by phone or email immediately and delete it!
>>> ***********************************************
>>>
> 

-- 
With best regards,
	Konstantin Boudnik (aka Cos)

        Yahoo! Grid Computing
        +1 (408) 349-4049

2CAC 8312 4870 D885 8616  6115 220F 6980 1F27 E622
Attention! Streams of consciousness are disallowed


Mime
View raw message