hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-779) Automatic move to safe-mode when cluster size drops
Date Fri, 23 Jul 2010 23:55:52 GMT

    [ https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891874#action_12891874
] 

Konstantin Shvachko commented on HDFS-779:
------------------------------------------

I think we should rather target directly the under-replicated to total block number ratio
instead of relying on the percentage of dead data-nodes.
Suppose that I add 20% new data-nodes and half of them dye within an hour. That is, 10% of
data-nodes failed, but all the new nodes are mostly empty, so there wont be much replication
triggered on the cluster after the failures, hence there is no need to enter safe mode.

NN keeps a count of under-replicated blocks and a count of the total number of blocks in the
system. 
So it can automatically enter safe mode if the ratio of under-replicated to total blocks reaches
1/10 (high mark).
It can then go back out of safe mode if the ratio drops back to 1/100 (low mark).
This solution is more in the spirit of safe-mode, which has always been about block replicas
rather than nodes.
In typical case this should be equivalent to the %DNs.


> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Owen O'Malley
>
> As part of looking at using Kerberos, we want to avoid the case where both the primary
(and optional secondary) KDC go offline causing a replication storm as the DataNodes' service
tickets time out and they lose the ability to connect to the NameNode. However, this is a
specific case of a more general problem of loosing too many nodes too quickly. I think we
should have an option to go into safe mode if the cluster size goes down more than N% in terms
of DataNodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message