hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Chansler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-779) Automatic move to safe-mode when cluster size drops
Date Wed, 25 Aug 2010 04:16:18 GMT

    [ https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902289#action_12902289

Robert Chansler commented on HDFS-779:

I, too, think it should all about the number of replicas that are missing. Not only are the
observed problems associated with the difficulty of urgently recreating many replicas, the
number of missing replicas is a satisfactory surrogate for the number of missing nodes without
having to consider the complexity of counting (or not counting) decommissioned nodes and nodes
missing for a long time., and worrying about cluster topology.

What should the retreat threshold be? Consider a large cluster with 100 racks of 40 nodes
each with 50,000 block replicas. Once a day a node will go bad. So a routine loss of replicas
is 50,000 at a time. It's unlikely a second node will fail in the few minutes it takes to
time out the missing node and recreate the missing replicas (remember, 3999 nodes work at
this task). If three nodes  (150,000 replicas) are missing, almost certainly something worse
than the uncorrelated loss of nodes has occurred. The first threshold to consider is more
like 0.075% rather than 10%.

If a rack switch is lost, 2 million replicas are instantly missing, 1% of the system. Left
alone, the system will recover in about an hour. How does that compare to the time to alert
an administrator and repair the switch? It may be close, and if the system retreats to safe
mode, an hour or so of service to users will be lost. (Service would be degraded during replication.)
So a policy of no retreat if _only_ a rack is lost can make sense.

But if a failed PDU takes out multiple racks (or a slice of the cluster) the cluster is hosed.
Blocks are certainly lost, and recovery is impossible. It makes good sense to retreat to safe
mode if more than a rack's worth of replicas are lost.

Having retreated, the usual rules for leaving safe mode seem to be just right after checking
that won't immediately retreat again!

But what if an administrator, having diagnosed the hardware situation, wants to command the
system to leave safe mode? There must be some way to evade an immediate retreat. Supposing
the retreat parameter is a startup configuration, a simple solution might be to have a new
administration command the changes the run time value of the retreat parameter. 

> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Owen O'Malley
> As part of looking at using Kerberos, we want to avoid the case where both the primary
(and optional secondary) KDC go offline causing a replication storm as the DataNodes' service
tickets time out and they lose the ability to connect to the NameNode. However, this is a
specific case of a more general problem of loosing too many nodes too quickly. I think we
should have an option to go into safe mode if the cluster size goes down more than N% in terms
of DataNodes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message