hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-572) Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.
Date Tue, 03 Oct 2006 17:35:21 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-572?page=comments#action_12439572 ] 
            
Konstantin Shvachko commented on HADOOP-572:
--------------------------------------------

I think the main problem with the heartbeats right now is the 1 minute timeout before they
fail.
Reducing the timeout to say 3 seconds (or may be 0 seconds we will need to experimant with
that)
could be an easy short term solution.
First of all, this will randomize data-nodes' access to the name-node, and give them equal
chances to
acknowledge their existence within the 10 minute interval.
Secondly, we will let other requests, the most often of which are leases extensions, to go
through
and succeed, which eventually will reduce the failure rate of map-reduce tasks.

IMO, this will not increase the load on the name-node, because the name-node does nothing
to reject
timed out requests. I expect the name-node replication rate to drop automatically since it
will be
processing other requests between individual block replications, which is desirable. The data-nodes
will have to leave with higher rate of TimeoutExceptions, which is acceptable.

I like the idea of self-adjusting heartbeats. Why not, if a data-node observes a consistent
30%
rate of timed out heartbeats it should increase the heartbeat interval by 30%. And making
the adjusted
parameters persistent is a good alternative to configuring.

I agree the request rates should not increase linearly with the cluster size. This makes self-adjustments
even more important, since optimizing configurations for different cluster sizes becomes a
non-trivial task.


> Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-572
>                 URL: http://issues.apache.org/jira/browse/HADOOP-572
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.6.2
>         Environment: Large dfs cluster
>            Reporter: Konstantin Shvachko
>
> I've observed a cluster crash caused by simultaneous failure of only 3 data-nodes.
> The crash is reproducable. In order to reproduce it you need a rather large cluster.
> To simplify calculations I'll consider a 600 node cluster as an example.
> The cluster should also contain a substantial amount of data.
> We will need at least 3 data-nodes containing 10,000+ blocks each.
> Now suppose that these 3 data-nodes fail at the same time, and the name-node
> started replicating all missing blocks belonging to the nodes.
> The name-node can replicate 50 blocks per second on average based on experimental data.
> Meaning, it will take more than 10 minutes, which is the heartbeat expiration interval,
> to replicates all 30,000+ blocks.
> With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats hitting the name-node
every second.
> Under heavy replication load the name-node accepts about 50 heartbeats per second.
> So at most 3/4 of all heartbeats remain unserved.
> Each node SHOULD send 200 heartbeats during the 10 minute interval, and every time the
probability
> of the heartbeat being unserved is 3/4 or less.
> So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from the practical
standpoint.
> IN FACT since current implementation sets the rpc timeout to 1 minute, a failed heartbeat
takes
> 1 minute and 8 seconds to complete, and under this circumstances each data-node can send
only
> 9 heartbeats during the 10 minute interval. Thus, the probability of failing of all 9
of them is 0.075,
> which means that we will loose 45 nodes out of 600 at the end of the 10 minute interval.
> From this point the name-node will be constantly replicating blocks and loosing more
nodes, and
> becomes effectively dysfunctional.
> A map-reduce framework running on top of it makes things deteriorate even faster, because
failing
> tasks and jobs are trying to remove files and re-create them again increasing the overall
load on
> the name-node.
> I see at least 2 problems that contribute to the chain reaction described above.
> 1. A heartbeat failure takes too long (1'8").
> 2. Name-node synchronized operations should be fine-grained.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message