Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E63C47579 for ; Sat, 8 Oct 2011 13:50:53 +0000 (UTC) Received: (qmail 6034 invoked by uid 500); 8 Oct 2011 13:50:53 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 5993 invoked by uid 500); 8 Oct 2011 13:50:53 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 5985 invoked by uid 99); 8 Oct 2011 13:50:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2011 13:50:53 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Oct 2011 13:50:51 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id B5CBE2AF51C for ; Sat, 8 Oct 2011 13:50:29 +0000 (UTC) Date: Sat, 8 Oct 2011 13:50:29 +0000 (UTC) From: "Ron Bodkin (Created) (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: <1174796045.11605.1318081829745.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Created] (HDFS-2420) improve handling of datanode timeouts MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org improve handling of datanode timeouts ------------------------------------- Key: HDFS-2420 URL: https://issues.apache.org/jira/browse/HDFS-2420 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ron Bodkin If a datanode ever times out on a heart beat, it gets marked dead permanently. I am finding that on AWS this is a periodic occurrence, i.e., datanodes time out although the datanode process is still alive. The current solution to this is to kill and restart each such process independently. It would be good if there were more retry logic (e.g., blacklisting the nodes but try heartbeats for a longer period before determining they are apparently dead). It would also be good if refreshNodes would check and attempt to recover timed out data nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira