From hadoop-dev-return-4092-apmail-lucene-hadoop-dev-archive=lucene.apache.org@lucene.apache.org Tue Oct 03 17:57:38 2006 Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 82712 invoked from network); 3 Oct 2006 17:57:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 3 Oct 2006 17:57:38 -0000 Received: (qmail 90615 invoked by uid 500); 3 Oct 2006 17:57:37 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 90579 invoked by uid 500); 3 Oct 2006 17:57:36 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 90529 invoked by uid 99); 3 Oct 2006 17:57:36 -0000 Received: from idunn.apache.osuosl.org (HELO idunn.apache.osuosl.org) (140.211.166.84) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Oct 2006 10:57:36 -0700 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests= Received: from [209.237.227.198] ([209.237.227.198:44482] helo=brutus.apache.org) by idunn.apache.osuosl.org (ecelerity 2.1.1.8 r(12930)) with ESMTP id AC/F6-08153-D84A2254 for ; Tue, 03 Oct 2006 10:57:33 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 397517142DA for ; Tue, 3 Oct 2006 10:57:24 -0700 (PDT) Message-ID: <10860973.1159898244232.JavaMail.root@brutus> Date: Tue, 3 Oct 2006 10:57:24 -0700 (PDT) From: "Doug Cutting (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-572) Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes. In-Reply-To: <27006103.1159840519548.JavaMail.root@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-572?page=comments#action_12439593 ] Doug Cutting commented on HADOOP-572: ------------------------------------- > IMO, this will not increase the load on the name-node ... It depends on where the bottlenecks are in the namenode. For example, if heartbeats are already using 75% of its capacity, and we want replications to use the last 25%, then vastly increasing the heartbeat rate will starve the replications. To my thinking, we should design things so that we don't see timeouts in normal operation (except when trying to contact nodes that are malfunctioning). In particular, we shouldn't use timeouts as a primary control mechanism. Also note that there are different kinds of timeouts. There are connect timeouts, which mean that the server never saw the request. Response timeouts, however, usually mean that the server has recieved the request and will eventually respond to it, but just not in the time you're willing to wait. In the latter case, the server won't notice that the client has timed out until *after* the response has been computed. So, if you're going to retry sooner, you should only do so after a connect timeout, and even then I'd argue that this is a poor solution. > Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes. > ---------------------------------------------------------------------------------------- > > Key: HADOOP-572 > URL: http://issues.apache.org/jira/browse/HADOOP-572 > Project: Hadoop > Issue Type: Bug > Affects Versions: 0.6.2 > Environment: Large dfs cluster > Reporter: Konstantin Shvachko > > I've observed a cluster crash caused by simultaneous failure of only 3 data-nodes. > The crash is reproducable. In order to reproduce it you need a rather large cluster. > To simplify calculations I'll consider a 600 node cluster as an example. > The cluster should also contain a substantial amount of data. > We will need at least 3 data-nodes containing 10,000+ blocks each. > Now suppose that these 3 data-nodes fail at the same time, and the name-node > started replicating all missing blocks belonging to the nodes. > The name-node can replicate 50 blocks per second on average based on experimental data. > Meaning, it will take more than 10 minutes, which is the heartbeat expiration interval, > to replicates all 30,000+ blocks. > With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats hitting the name-node every second. > Under heavy replication load the name-node accepts about 50 heartbeats per second. > So at most 3/4 of all heartbeats remain unserved. > Each node SHOULD send 200 heartbeats during the 10 minute interval, and every time the probability > of the heartbeat being unserved is 3/4 or less. > So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from the practical standpoint. > IN FACT since current implementation sets the rpc timeout to 1 minute, a failed heartbeat takes > 1 minute and 8 seconds to complete, and under this circumstances each data-node can send only > 9 heartbeats during the 10 minute interval. Thus, the probability of failing of all 9 of them is 0.075, > which means that we will loose 45 nodes out of 600 at the end of the 10 minute interval. > From this point the name-node will be constantly replicating blocks and loosing more nodes, and > becomes effectively dysfunctional. > A map-reduce framework running on top of it makes things deteriorate even faster, because failing > tasks and jobs are trying to remove files and re-create them again increasing the overall load on > the name-node. > I see at least 2 problems that contribute to the chain reaction described above. > 1. A heartbeat failure takes too long (1'8"). > 2. Name-node synchronized operations should be fine-grained. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira