hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2420) improve handling of datanode timeouts
Date Sun, 09 Oct 2011 22:15:29 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123793#comment-13123793

Suresh Srinivas commented on HDFS-2420:

Can you please add Affects Versions field in the jira.

Default keep alive timeout is 600 seconds. That is a long time given default heart beat timeout
is 3s. Also the datanode does keep retrying to connect to namenode. So I am not sure about
the issue you have reported here. Can you upload logs for the scenario in this bug.
> improve handling of datanode timeouts
> -------------------------------------
>                 Key: HDFS-2420
>                 URL: https://issues.apache.org/jira/browse/HDFS-2420
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ron Bodkin
> If a datanode ever times out on a heart beat, it gets marked dead permanently. I am finding
that on AWS this is a periodic occurrence, i.e., datanodes time out although the datanode
process is still alive. The current solution to this is to kill and restart each such process
> It would be good if there were more retry logic (e.g., blacklisting the nodes but try
heartbeats for a longer period before determining they are apparently dead). It would also
be good if refreshNodes would check and attempt to recover timed out data nodes.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message