hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-572) Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.
Date Wed, 04 Oct 2006 17:59:20 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-572?page=comments#action_12439923 ] 
Devaraj Das commented on HADOOP-572:

How about this approach:
* The DataNodes register once with the NameNode after the latter comes up (and registers just
once - no heartbeats).
* Whenever the NameNode requires the services of a DataNode (to store/delete blocks), it pings
the chosen DataNode to see whether it is indeed alive. As a response to the "ping", the DataNode
sends its latest status (block report, free disk space, etc.). The response can also be controlled
- if the DataNode sent its status once in the last 30 secs, it doesn't send it again.
* For each DataNode, the NameNode maintains a list of the blocks it hosts.
* A separate thread in the NameNode pings the set of known DataNodes and whenever it is not
able to ping a particular DataNode, it issues the replication requests of the blocks that
were there in that datanode (for this it takes the help of the inverse mapping from a "block"
to the set of DataNodes containing that block).

A directory walk may also be a thing to look at - the NameNode walks the file system hierarchy
and for each file it pings the set of all DataNodes containing the blocks of that file but
in this case, we will end up pinging the same DataNode multiple times. To avoid this, the
(DataNode->{blocks}) mapping can be used.

> Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.
> ----------------------------------------------------------------------------------------
>                 Key: HADOOP-572
>                 URL: http://issues.apache.org/jira/browse/HADOOP-572
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.6.2
>         Environment: Large dfs cluster
>            Reporter: Konstantin Shvachko
> I've observed a cluster crash caused by simultaneous failure of only 3 data-nodes.
> The crash is reproducable. In order to reproduce it you need a rather large cluster.
> To simplify calculations I'll consider a 600 node cluster as an example.
> The cluster should also contain a substantial amount of data.
> We will need at least 3 data-nodes containing 10,000+ blocks each.
> Now suppose that these 3 data-nodes fail at the same time, and the name-node
> started replicating all missing blocks belonging to the nodes.
> The name-node can replicate 50 blocks per second on average based on experimental data.
> Meaning, it will take more than 10 minutes, which is the heartbeat expiration interval,
> to replicates all 30,000+ blocks.
> With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats hitting the name-node
every second.
> Under heavy replication load the name-node accepts about 50 heartbeats per second.
> So at most 3/4 of all heartbeats remain unserved.
> Each node SHOULD send 200 heartbeats during the 10 minute interval, and every time the
> of the heartbeat being unserved is 3/4 or less.
> So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from the practical
> IN FACT since current implementation sets the rpc timeout to 1 minute, a failed heartbeat
> 1 minute and 8 seconds to complete, and under this circumstances each data-node can send
> 9 heartbeats during the 10 minute interval. Thus, the probability of failing of all 9
of them is 0.075,
> which means that we will loose 45 nodes out of 600 at the end of the 10 minute interval.
> From this point the name-node will be constantly replicating blocks and loosing more
nodes, and
> becomes effectively dysfunctional.
> A map-reduce framework running on top of it makes things deteriorate even faster, because
> tasks and jobs are trying to remove files and re-create them again increasing the overall
load on
> the name-node.
> I see at least 2 problems that contribute to the chain reaction described above.
> 1. A heartbeat failure takes too long (1'8").
> 2. Name-node synchronized operations should be fine-grained.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message