hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-779) Automatic move to safe-mode when cluster size drops
Date Fri, 24 Sep 2010 07:57:38 GMT

    [ https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914375#action_12914375

Eli Collins commented on HDFS-779:

IIUC the use case for "catastrophic mode" is that in the case of a network partition some
subset of active jobs could continue successfully because their blocks happen to be available.
Would love to hear data from real clusters, but my gut says this set of jobs is unlikely to
be interesting. Even if all the blocks a job accesses are in the partition that contains the
NN, most clusters have dependencies between jobs, so the probability that all jobs for a given
activity won't fail also seems low.  This is about availability, trying to be more available
in doubly weird states doesn't necessarily make you more available (ie bugs handling doubly
weird states make you less available then had you not entered them).

I think an accrual failure detector (based on # under-replicated blocks / total blocks) that
results in a drop to safe mode makes sense (ie drop into safe mode as soon as the system is
singly weird, based on a continuous data availability metric).  

> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Owen O'Malley
>            Assignee: dhruba borthakur
> As part of looking at using Kerberos, we want to avoid the case where both the primary
(and optional secondary) KDC go offline causing a replication storm as the DataNodes' service
tickets time out and they lose the ability to connect to the NameNode. However, this is a
specific case of a more general problem of loosing too many nodes too quickly. I think we
should have an option to go into safe mode if the cluster size goes down more than N% in terms
of DataNodes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message