Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 36851 invoked from network); 13 Sep 2010 05:59:59 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 13 Sep 2010 05:59:59 -0000 Received: (qmail 92516 invoked by uid 500); 13 Sep 2010 05:59:59 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 92411 invoked by uid 500); 13 Sep 2010 05:59:57 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 92403 invoked by uid 99); 13 Sep 2010 05:59:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Sep 2010 05:59:56 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Sep 2010 05:59:55 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o8D5xZpP023107 for ; Mon, 13 Sep 2010 05:59:35 GMT Message-ID: <5475489.147781284357575257.JavaMail.jira@thor> Date: Mon, 13 Sep 2010 01:59:35 -0400 (EDT) From: "dhruba borthakur (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Commented: (HDFS-779) Automatic move to safe-mode when cluster size drops In-Reply-To: <492528515.1258656579663.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908630#action_12908630 ] dhruba borthakur commented on HDFS-779: --------------------------------------- I forgot to mention the basic premise for *delaying* the replication event. When a catastrophic event happens, the replicas (and datanodes) are actually alive and n good health, it is just that they are unable to communicate with the namenode. So, immediate replication does not make sense. It is better to wait for some time to see if the existing replicas come back to life. > Automatic move to safe-mode when cluster size drops > --------------------------------------------------- > > Key: HDFS-779 > URL: https://issues.apache.org/jira/browse/HDFS-779 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node > Reporter: Owen O'Malley > Assignee: dhruba borthakur > > As part of looking at using Kerberos, we want to avoid the case where both the primary (and optional secondary) KDC go offline causing a replication storm as the DataNodes' service tickets time out and they lose the ability to connect to the NameNode. However, this is a specific case of a more general problem of loosing too many nodes too quickly. I think we should have an option to go into safe mode if the cluster size goes down more than N% in terms of DataNodes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.