hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-779) Automatic move to safe-mode when cluster size drops
Date Thu, 23 Sep 2010 20:29:36 GMT

    [ https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914213#action_12914213

dhruba borthakur commented on HDFS-779:

> 1. the load-induced-catastrophe can be solved by prioritizing heartbeats;

I agree. I have already started on it via HADOOP-6952.

> 2. postponing death declaration may make even more harm to the system in case of a real

There are two ways to fix this.  One way is to stop all existing workloads, declare nodes
dead and devote all bandwidth to re-replicating stuff. Typically, an alert will fire somewhere
that will make the cluster administrator fix the cause of the problem. This manual fixing
could take a few minutes (e.g. replace a faulty port on the switch, etc) or it could take

If it takes a few minutes to repair the hardware, it is better if HDFS continues to service
existing workloads and not declare safemode. Within minutes (once the hardware fault is repaired)
the unreachable datanodes will possibly come back to life and life is back to normal. If it
takes hours to repair the fault, then anyway there is no downside to starting the re-replication
process slightly later than usual.

Here is a scenario that explains the same situation from another angle. Suppose the NN suddenly
stopped getting heartbeats from more than 80% of the datanodes on a balanced cluster? What,
in your opinion, should the NN do? Even if it enters safemode immediately and starts replicating
it is unlikely that it will even exit safemode because there will be plenty of missing blocks
even after all possible replications are done. The only way it can exit safemode is when a
majority of those lost datanodes rejoin the cluster! So, my point is that it would be nice
if we could build some *heuristics* in the NN that can distinguish between a network partition
event and datanode process deaths and behave more intelligently.

> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Owen O'Malley
>            Assignee: dhruba borthakur
> As part of looking at using Kerberos, we want to avoid the case where both the primary
(and optional secondary) KDC go offline causing a replication storm as the DataNodes' service
tickets time out and they lose the ability to connect to the NameNode. However, this is a
specific case of a more general problem of loosing too many nodes too quickly. I think we
should have an option to go into safe mode if the cluster size goes down more than N% in terms
of DataNodes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message