Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Thu, 3 Apr 2014 02:32:16 +0000 (UTC)
From: "Tsz Wo Nicholas Sze (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12526325.1318081783232.55493.1396492336787@arcas>
In-Reply-To: <JIRA.12526325.1318081783232@arcas>
References: <JIRA.12526325.1318081783232@arcas>
Subject: [jira] [Resolved] (HDFS-2420) improve handling of datanode timeouts
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HDFS-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo Nicholas Sze resolved HDFS-2420.
---------------------------------------

    Resolution: Not a Problem

I guess that this is not a problem anymore. Please feel free to reopen this if I am wrong. Resolving ...

> improve handling of datanode timeouts
> -------------------------------------
>
>                 Key: HDFS-2420
>                 URL: https://issues.apache.org/jira/browse/HDFS-2420
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ron Bodkin
>
> If a datanode ever times out on a heart beat, it gets marked dead permanently. I am finding that on AWS this is a periodic occurrence, i.e., datanodes time out although the datanode process is still alive. The current solution to this is to kill and restart each such process independently. 
> It would be good if there were more retry logic (e.g., blacklisting the nodes but try heartbeats for a longer period before determining they are apparently dead). It would also be good if refreshNodes would check and attempt to recover timed out data nodes.


--
This message was sent by Atlassian JIRA
(v6.2#6252)