hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-779) Automatic move to safe-mode when cluster size drops
Date Mon, 13 Sep 2010 05:55:38 GMT

    [ https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908629#action_12908629

dhruba borthakur commented on HDFS-779:

I would like to work on this one. A similar issue hit our cluster when a bad job started creating
millions of files within a few minutes, casing the NN to get very very with create transactions
and not being able to process heartbeats from datanodes. Out of the 3000 datanodes in our
cluster, about 1000 lost heartbeats with the namenode, thus causing a replication storm.

1. From my perspective, the key is to detect the difference between a network partition event
versus datanode processes actually dying due to normal wear-and-tear. A few datanodes die
every day, so there is a pretty well known rate of failure of datanodes. For these cases,
it is best if the NN starts replicating blocks as soon as it detects that a datanode is dead.
On the other hand, once a while catastrophic event occurs (i.e. network partition, bad jobs
causing denial-of-service for datanode heartbeats, etc) that cause a bunch of datanodes to
disappear right away. For these catastrophic scenarios, it is better if the NN does not start
replication right away, otherwise it causes a replication storm; instead wait for a slight-extended-time
E to see if the catastrophic situation rectifies itself, otherwise the replication storm actually
increases the duration of the downtime. It should not fall into safe-mode either, otherwise
all existing jobs will fail; instead it should delay replication for the time period determined
by E.

2. The failure mode here is a *datanode* (not a block or a set of blocks), so it makes sense
to make this heuristic be based on the number of datanodes, rather than the number of blocks.
The failed-number of blocks is not a proxy for the number of failed datanodes, because the
datanodes in the cluster need not be balanced state all the time.

3. The distinguishing factor between a catastrophic event vs a regular one seems to be highly
correlated to the rate of failure of datanodes. In a typical cluster of 3000 nodes, we see
maybe 2 datanodes dying every day. On the other hand, a catastrophic event triggers the loss
of many hundred datanodes within a span of a few minutes.

My proposal is as follows:

Let the namenode remember the count of "lost heartbeat from datanodes" in the last day as
well as the current day.
  N1 = number of datanodes that died yesterday
  N2 = number of datanodes that have died till now today
  If (N2/N2 >   m) then we declare it as a catastrophic event.

When a catastrophic event is detected, the namenode delays replication for a slightly extended
(configurable) period of time.

> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Owen O'Malley
> As part of looking at using Kerberos, we want to avoid the case where both the primary
(and optional secondary) KDC go offline causing a replication storm as the DataNodes' service
tickets time out and they lose the ability to connect to the NameNode. However, this is a
specific case of a more general problem of loosing too many nodes too quickly. I think we
should have an option to go into safe mode if the cluster size goes down more than N% in terms
of DataNodes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message