hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-572) Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.
Date Wed, 04 Oct 2006 18:07:22 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-572?page=comments#action_12439925 ] 
            
Doug Cutting commented on HADOOP-572:
-------------------------------------

Devaraj: I think your proposal is part of the long-term solution.  A short-term fix will simply
be to control the rate of replication and increase the heartbeat interval.  Longer-term we
should consider the sort of inverted control you propose.

> Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-572
>                 URL: http://issues.apache.org/jira/browse/HADOOP-572
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.6.2
>         Environment: Large dfs cluster
>            Reporter: Konstantin Shvachko
>
> I've observed a cluster crash caused by simultaneous failure of only 3 data-nodes.
> The crash is reproducable. In order to reproduce it you need a rather large cluster.
> To simplify calculations I'll consider a 600 node cluster as an example.
> The cluster should also contain a substantial amount of data.
> We will need at least 3 data-nodes containing 10,000+ blocks each.
> Now suppose that these 3 data-nodes fail at the same time, and the name-node
> started replicating all missing blocks belonging to the nodes.
> The name-node can replicate 50 blocks per second on average based on experimental data.
> Meaning, it will take more than 10 minutes, which is the heartbeat expiration interval,
> to replicates all 30,000+ blocks.
> With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats hitting the name-node
every second.
> Under heavy replication load the name-node accepts about 50 heartbeats per second.
> So at most 3/4 of all heartbeats remain unserved.
> Each node SHOULD send 200 heartbeats during the 10 minute interval, and every time the
probability
> of the heartbeat being unserved is 3/4 or less.
> So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from the practical
standpoint.
> IN FACT since current implementation sets the rpc timeout to 1 minute, a failed heartbeat
takes
> 1 minute and 8 seconds to complete, and under this circumstances each data-node can send
only
> 9 heartbeats during the 10 minute interval. Thus, the probability of failing of all 9
of them is 0.075,
> which means that we will loose 45 nodes out of 600 at the end of the 10 minute interval.
> From this point the name-node will be constantly replicating blocks and loosing more
nodes, and
> becomes effectively dysfunctional.
> A map-reduce framework running on top of it makes things deteriorate even faster, because
failing
> tasks and jobs are trying to remove files and re-create them again increasing the overall
load on
> the name-node.
> I see at least 2 problems that contribute to the chain reaction described above.
> 1. A heartbeat failure takes too long (1'8").
> 2. Name-node synchronized operations should be fine-grained.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message