hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sameer Paranjpye (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-572) Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.
Date Wed, 18 Oct 2006 18:03:37 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-572?page=all ]

Sameer Paranjpye updated HADOOP-572:

    Component/s: dfs

> Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.
> ----------------------------------------------------------------------------------------
>                 Key: HADOOP-572
>                 URL: http://issues.apache.org/jira/browse/HADOOP-572
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.6.2
>         Environment: Large dfs cluster
>            Reporter: Konstantin Shvachko
> I've observed a cluster crash caused by simultaneous failure of only 3 data-nodes.
> The crash is reproducable. In order to reproduce it you need a rather large cluster.
> To simplify calculations I'll consider a 600 node cluster as an example.
> The cluster should also contain a substantial amount of data.
> We will need at least 3 data-nodes containing 10,000+ blocks each.
> Now suppose that these 3 data-nodes fail at the same time, and the name-node
> started replicating all missing blocks belonging to the nodes.
> The name-node can replicate 50 blocks per second on average based on experimental data.
> Meaning, it will take more than 10 minutes, which is the heartbeat expiration interval,
> to replicates all 30,000+ blocks.
> With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats hitting the name-node
every second.
> Under heavy replication load the name-node accepts about 50 heartbeats per second.
> So at most 3/4 of all heartbeats remain unserved.
> Each node SHOULD send 200 heartbeats during the 10 minute interval, and every time the
> of the heartbeat being unserved is 3/4 or less.
> So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from the practical
> IN FACT since current implementation sets the rpc timeout to 1 minute, a failed heartbeat
> 1 minute and 8 seconds to complete, and under this circumstances each data-node can send
> 9 heartbeats during the 10 minute interval. Thus, the probability of failing of all 9
of them is 0.075,
> which means that we will loose 45 nodes out of 600 at the end of the 10 minute interval.
> From this point the name-node will be constantly replicating blocks and loosing more
nodes, and
> becomes effectively dysfunctional.
> A map-reduce framework running on top of it makes things deteriorate even faster, because
> tasks and jobs are trying to remove files and re-create them again increasing the overall
load on
> the name-node.
> I see at least 2 problems that contribute to the chain reaction described above.
> 1. A heartbeat failure takes too long (1'8").
> 2. Name-node synchronized operations should be fine-grained.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message