hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-779) Automatic move to safe-mode when cluster size drops
Date Fri, 24 Sep 2010 04:27:37 GMT

    [ https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914325#action_12914325

dhruba borthakur commented on HDFS-779:

Thanks Robert for asking these questions. These are the same questions that came to my mind
when I wrote up my previous comment.

Just to clarify: 
a) safemode == "stopping replication" *AND* "making existing workloads fail". 
    if hdfs goes into safemode then all existing jobs fail almost instantaneously.
b) catastrophic mode == "stopping/delaying replication" *BUT* allowing existing jobs to continue
to run

> here is a danger of driving the system to a more weird state:

exactly. My motivation for this JIRA is that if the system can detect a network partition
then instead of starting to replicate, it should just wait-and-watch for sometime. Otherwise,
the entire system will cause itself to trash, namenode becomes unresponsive (because underreplicated
queue grows very big), and causes the possibility of the namenode exceeding its allocated
heap usage and thereby crashing. These are bad and are "weird states". Instead, the namenode
should delay/stop replication for a while. 

So, in a nutshell, Robert and I concur vehemently. The only debatable question is whether
to make the system retreat to safe mode or to catastrophic mode (as defined above). Rob: any

> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Owen O'Malley
>            Assignee: dhruba borthakur
> As part of looking at using Kerberos, we want to avoid the case where both the primary
(and optional secondary) KDC go offline causing a replication storm as the DataNodes' service
tickets time out and they lose the ability to connect to the NameNode. However, this is a
specific case of a more general problem of loosing too many nodes too quickly. I think we
should have an option to go into safe mode if the cluster size goes down more than N% in terms
of DataNodes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message