hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-779) Automatic move to safe-mode when cluster size drops
Date Fri, 24 Sep 2010 17:25:38 GMT

    [ https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914547#action_12914547
] 

dhruba borthakur commented on HDFS-779:
---------------------------------------

I think all three of us: Eli, Rob and I are saying that if a network partition occurs, it
is better if the system does not immediately start replicating, instead drop into a mode that
stop/delays replication, at least for a while. This point, we all agree.

The related issue where there is still no consensus is whether to fail existing jobs (via
safemode) when such a weird network partition occurs. Maybe we can defer that discussion to
another JIRA? and let this jira focus on how to detect these weird network partitioning scenarios?

So, what should we do in the face of a catastrophe? Here is what Konstantin summarised earlier:

1. In the original proposal a catastrophic is declared when num-failed-nodes / total-nodes
> x%
2. Dhruba's catastrophe happens when rate-of-failed-nodes-in-lasttime-period is much much
higher than the rate in an earlier period.
3. Kons proposal is defined as num-under-replicated-blocks / total-blocks > r



> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Owen O'Malley
>            Assignee: dhruba borthakur
>
> As part of looking at using Kerberos, we want to avoid the case where both the primary
(and optional secondary) KDC go offline causing a replication storm as the DataNodes' service
tickets time out and they lose the ability to connect to the NameNode. However, this is a
specific case of a more general problem of loosing too many nodes too quickly. I think we
should have an option to go into safe mode if the cluster size goes down more than N% in terms
of DataNodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message