hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-2144) Data node process consumes 180% cpu
Date Fri, 02 Nov 2007 20:46:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539697
] 

rangadi edited comment on HADOOP-2144 at 11/2/07 1:46 PM:
---------------------------------------------------------------

> With 128MB blocks, it should be possible to see this with 'iostat -x' or 'sar', reporting
once per second. If they're sync'd, then you'd expect to see one drive at 100% busy and the
others at 0% busy, with the busy drive switching every three seconds. If they're out of sync,
then you'd expect the drives to mostly be 100% busy, but some to occasionally be idle.

In my test, I did 'iostat -x 3' with 4 clients, most of the time only 2 disks we rebusy (50
- 70% util). This is also because 4 client test was cpu bound.

In ideal case where there are no other bottlenecks other than disks, a stable equilibrium
would be reading from all the disks. Assume clients are all synchronized to start with. Then
all will be reading from same kernel buffer from one disk at the best speed for one disk.
As soon as there is some disturbance, client A goes ahead of client B, more the separation
more the change in throughput will be (after some initial threshold). So this will get worse
till the point where the clients read from different disks, in which case, both will catch
up in speed.

      was (Author: rangadi):
    > With 128MB blocks, it should be possible to see this with 'iostat -x' or 'sar', reporting
once per second. If they're sync'd, then you'd expect to see one drive at 100% busy and the
others at 0% busy, with the busy drive switching every three seconds. If they're out of sync,
then you'd expect the drives to mostly be 100% busy, but some to occasionally be idle.

In my test, I did 'iostat -x 3' with 4 clients most of the time only 2 disks we busy (50 -
70% util). This also because 4 client test was cpu bound.

In ideal case where there are no other bottlenecks other than disks, a stable equilibrium
would be reading from all the disks. To start with, assume there are all synchronized to start
with. Then all will be reading from same kernel buffer from one disk, which is the best speed
for one disk. As soon as there is some disturbance, client A goes ahead of client B, more
the separation more the difference in speeds will be (after some initial threshold). So this
will get worse till the point where the clients read from different disks, in which case,
second client catches up in speed.
  
> Data node process consumes 180% cpu 
> ------------------------------------
>
>                 Key: HADOOP-2144
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2144
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Runping Qi
>
> I did a test on DFS read throughput and found that the data node 
> process consumes up to 180% cpu when it is under heavi load. Here are the details:
> The cluster has 380+ machines, each with 3GB mem and 4 cpus and 4 disks.
> I copied a 10GB file to dfs from one machine with a data node running there.
> Based on the dfs block placement policy, that machine has one replica for each block
of the file.
> then I run 4 of the following commands in parellel:
> hadoop dfs -cat thefile > /dev/null &
> Since all the blocks have a local replica, all the read requests went to the local data
node.
> I observed that:
>     The data node process's cpu usage was around 180% for most of the time .
>     The clients's cpu usage was moderate (as it should be).
>     All the four disks were working concurrently with comparable read throughput.
>     The total read throughput was maxed at 90MB/Sec, about 60% of the expected total

>     aggregated max read throughput of 4 disks (160MB/Sec). Thus disks were not a bottleneck
>     in this case.
> The data node's cpu usage seems unreasonably high.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message