hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HDFS-779) Automatic move to safe-mode when cluster size drops
Date Mon, 13 Sep 2010 07:48:36 GMT

    [ https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908629#action_12908629
] 

dhruba borthakur edited comment on HDFS-779 at 9/13/10 3:47 AM:
----------------------------------------------------------------

I would like to work on this one. A similar issue hit our cluster when a bad job started creating
millions of files within a few minutes, causing the NN to get very very sluggish with create
transactions and not being able to process heartbeats from datanodes. Out of the 3000 datanodes
in our cluster, about 1000 lost heartbeats with the namenode, thus causing a replication storm.

1. The key is to detect the difference between a network partition event versus datanode processes
actually dying due to normal wear-and-tear. A few datanodes die every day, so there is a pretty
well known rate of failure of datanodes. For these cases, it is best if the NN starts replicating
blocks as soon as it detects that a datanode is dead. On the other hand, once-a-while catastrophic
event occurs (i.e. network partition, bad jobs causing denial-of-service for datanode heartbeats,
etc) that causes a bunch of datanodes to disappear right away. For these catastrophic scenarios,
it is better if the NN does not start replication right away, otherwise it causes a replication
storm; instead wait for a slight-extended-time E to see if the catastrophic situation rectifies
itself, otherwise the replication storm actually increases the duration of the downtime. It
should not fall into safe-mode either, otherwise all existing jobs will fail; instead it should
delay replication for the time period determined by E.

2. The failure mode here is a *datanode* (not a block or a set of blocks), so it makes sense
to make this heuristic be based on the number of datanodes, rather than the number of blocks.
The number of missing blocks is not a proxy for the number of failed datanodes, because the
datanodes in the cluster need not be in a  balanced state when the catastrophic event occured.
Some datanodes could be ery full while some other newly provisioned datanodes could be relatively
empty.

3. The distinguishing factor between a catastrophic event vs a regular one is tightly  correlated
to the rate of failure of datanodes. In a typical cluster of 3000 nodes, we see maybe 2 datanodes
dying every day. On the other hand, a catastrophic event triggers the loss of many hundred
datanodes within a span of a few minutes.

My proposal is as follows:

Let the namenode remember the count of "lost heartbeat from datanodes" in the last day as
well as the current day.
  N1 = number of datanodes that died yesterday
  N2 = number of datanodes that have died till now today
  If (N2/N2 >   m) then we declare it as a catastrophic event.

When a catastrophic event is detected, the namenode delays replication for a slightly extended
(configurable) period of time.


      was (Author: dhruba):
    I would like to work on this one. A similar issue hit our cluster when a bad job started
creating millions of files within a few minutes, casing the NN to get very very with create
transactions and not being able to process heartbeats from datanodes. Out of the 3000 datanodes
in our cluster, about 1000 lost heartbeats with the namenode, thus causing a replication storm.

1. From my perspective, the key is to detect the difference between a network partition event
versus datanode processes actually dying due to normal wear-and-tear. A few datanodes die
every day, so there is a pretty well known rate of failure of datanodes. For these cases,
it is best if the NN starts replicating blocks as soon as it detects that a datanode is dead.
On the other hand, once a while catastrophic event occurs (i.e. network partition, bad jobs
causing denial-of-service for datanode heartbeats, etc) that cause a bunch of datanodes to
disappear right away. For these catastrophic scenarios, it is better if the NN does not start
replication right away, otherwise it causes a replication storm; instead wait for a slight-extended-time
E to see if the catastrophic situation rectifies itself, otherwise the replication storm actually
increases the duration of the downtime. It should not fall into safe-mode either, otherwise
all existing jobs will fail; instead it should delay replication for the time period determined
by E.

2. The failure mode here is a *datanode* (not a block or a set of blocks), so it makes sense
to make this heuristic be based on the number of datanodes, rather than the number of blocks.
The failed-number of blocks is not a proxy for the number of failed datanodes, because the
datanodes in the cluster need not be balanced state all the time.

3. The distinguishing factor between a catastrophic event vs a regular one seems to be highly
correlated to the rate of failure of datanodes. In a typical cluster of 3000 nodes, we see
maybe 2 datanodes dying every day. On the other hand, a catastrophic event triggers the loss
of many hundred datanodes within a span of a few minutes.

My proposal is as follows:

Let the namenode remember the count of "lost heartbeat from datanodes" in the last day as
well as the current day.
  N1 = number of datanodes that died yesterday
  N2 = number of datanodes that have died till now today
  If (N2/N2 >   m) then we declare it as a catastrophic event.

When a catastrophic event is detected, the namenode delays replication for a slightly extended
(configurable) period of time.







  
> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Owen O'Malley
>            Assignee: dhruba borthakur
>
> As part of looking at using Kerberos, we want to avoid the case where both the primary
(and optional secondary) KDC go offline causing a replication storm as the DataNodes' service
tickets time out and they lose the ability to connect to the NameNode. However, this is a
specific case of a more general problem of loosing too many nodes too quickly. I think we
should have an option to go into safe mode if the cluster size goes down more than N% in terms
of DataNodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message