hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Martin <m...@andremartin.de>
Subject Re: Performance / cluster scaling question
Date Fri, 21 Mar 2008 21:48:08 GMT
Right, I totally forgot about the replication factor... However 
sometimes I even noticed ratios of 5:1 for block numbers to files...
Is the delay for block deletion/reclaiming an intended behavior?

Jeff Eastman wrote:
> That makes the math come out a lot closer (3*423763=1271289). I've also
> noticed there is some delay in reclaiming unused blocks so what you are
> seeing in terms of block allocations do not surprise me.
>
>   
>> -----Original Message-----
>> From: André Martin [mailto:mail@andremartin.de]
>> Sent: Friday, March 21, 2008 2:36 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: Performance / cluster scaling question
>>
>> 3 - the default one...
>>
>> Jeff Eastman wrote:
>>     
>>> What's your replication factor?
>>> Jeff
>>>
>>>
>>>       
>>>> -----Original Message-----
>>>> From: André Martin [mailto:mail@andremartin.de]
>>>> Sent: Friday, March 21, 2008 2:25 PM
>>>> To: core-user@hadoop.apache.org
>>>> Subject: Performance / cluster scaling question
>>>>
>>>> Hi everyone,
>>>> I ran a distributed system that consists of 50 spiders/crawlers and 8
>>>> server nodes with a Hadoop DFS cluster with 8 datanodes and a
>>>>         
>> namenode...
>>     
>>>> Each spider has 5 job processing / data crawling threads and puts
>>>> crawled data as one complete file onto the DFS - additionally there are
>>>> splits created for each server node that are put as files onto the DFS
>>>> as well. So basically there are 50*5*9 = ~2250 concurrent writes across
>>>> 8 datanodes.
>>>> The splits are read by the server nodes and will be deleted afterwards,
>>>> so those (split)-files exists for only a few seconds to minutes...
>>>> Since 99% of the files have a size of less than 64 MB (the default
>>>>         
>> block
>>     
>>>> size) I expected that the number of files is roughly equal to the
>>>>         
>> number
>>     
>>>> of blocks. After running the system for 24hours the namenode WebUI
>>>>         
>> shows
>>     
>>>> 423763 files and directories and 1480735 blocks. It looks like that the
>>>> system does not catch up with deleting all the invalidated blocks - my
>>>> assumption?!?
>>>> Also, I noticed that the overall performance of the cluster goes down
>>>> (see attached image).
>>>> There are a bunch of Could not get block locations. Aborting...
>>>> exceptions and those exceptions seem to appear more frequently towards
>>>> the end of the experiment.
>>>>
>>>>         
>>>>> java.io.IOException: Could not get block locations. Aborting...
>>>>>     at
>>>>>
>>>>>
>>>>>           
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCl
>>     
>>>> ient.java:1824)
>>>>
>>>>         
>>>>>     at
>>>>>
>>>>>
>>>>>           
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java
>>     
>>>> :1479)
>>>>
>>>>         
>>>>>     at
>>>>>
>>>>>
>>>>>           
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient
>>     
>>>> .java:1571)
>>>> So, is the cluster simply saturated with the such a frequent creation
>>>> and deletion of files, or is the network that actual bottleneck? The
>>>> work load does not change at all during the whole experiment.
>>>> On cluster side I see lots of the following exceptions:
>>>>
>>>>         
>> = >>> 2008-03-21 20:28:05,411 INFO org.apache.hadoop.dfs.DataNode:
>>     
>>>>> PacketResponder 1 for block blk_6757062148746339382 terminating
>>>>> 2008-03-21 20:28:05,411 INFO org.apache.hadoop.dfs.DataNode:
>>>>> writeBlock blk_6757062148746339382 received exception
>>>>>
>>>>>           
>>>> java.io.EOFException
>>>>
>>>>         
>>>>> 2008-03-21 20:28:05,411 ERROR org.apache.hadoop.dfs.DataNode:
>>>>> 141.xxx..xxx.xxx:50010:DataXceiver: java.io.EOFException
>>>>>     at java.io.DataInputStream.readInt(Unknown Source)
>>>>>     at
>>>>>
>>>>>
>>>>>           
>> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:22
>>     
>>>> 63)
>>>>
>>>>         
>>>>>     at
>>>>>
>>>>>
>>>>>           
>> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)
>>     
>>>>>     at
>>>>>           
>> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
>>     
>>>>>     at java.lang.Thread.run(Unknown Source)
>>>>> 2008-03-21 19:26:46,535 INFO org.apache.hadoop.dfs.DataNode:
>>>>> writeBlock blk_-7369396710977076579 received exception
>>>>> java.net.SocketException: Connection reset
>>>>> 2008-03-21 19:26:46,535 ERROR org.apache.hadoop.dfs.DataNode:
>>>>> 141.xxx.xxx.xxx:50010:DataXceiver: java.net.SocketException:
>>>>> Connection reset
>>>>>     at java.net.SocketInputStream.read(Unknown Source)
>>>>>     at java.io.BufferedInputStream.fill(Unknown Source)
>>>>>     at java.io.BufferedInputStream.read(Unknown Source)
>>>>>     at java.io.DataInputStream.readInt(Unknown Source)
>>>>>     at
>>>>>
>>>>>
>>>>>           
>> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:22
>>     
>>>> 63)
>>>>
>>>>         
>>>>>     at
>>>>>
>>>>>
>>>>>           
>> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)
>>     
>>>>>     at
>>>>>           
>> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
>>     
>>>>>     at java.lang.Thread.run(Unknown Source)
>>>>>
>>>>>           
>>>> I'm running Hadoop 0.16.1 - Has anyone made the same or a similar
>>>> experience.
>>>> How can the performance degradation be avoided? More datanodes? Why
>>>> seems the block deletion not to catch up with the deletion of the file?
>>>> Thanks in advance for your insights, ideas & suggestions :-)
>>>>
>>>> Cu on the 'net,
>>>>                         Bye - bye,
>>>>
>>>>                                    <<<<< André <<<<
>>>> èrbnA >>>>>
>>>>
>>>>         
>
>
>   



Mime
View raw message