Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Fri, 4 May 2012 07:43:18 +0000 (UTC)
From: "Konstantin Shvachko (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: 
 <1911477551.26100.1336117398309.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <2086549363.26065.1336115568729.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Updated] (HDFS-3368) Missing blocks due to bad DataNodes
 comming up and down.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko updated HDFS-3368:
--------------------------------------

         Description: All replicas of a block can be removed if bad DataNodes come up and down during cluster restart resulting in data loss.  (was: All replicas of a block can be removed if bad DataNodes come up and down during cluter restart resulting in data loss.)
    Target Version/s: 0.22.1, 2.0.0, 3.0.0  (was: 3.0.0, 2.0.0, 0.22.1)

- A block b has 3 replicas initially located on DNs do1, do2, do3.
- At different times all three nodes malfunctioned and died, causing the replicas to be migrate to dn1, dn2, dn3.
- do1, do2, do3 were not added to the exclude list.
And when the cluster restarts do1, do2, do3 are brought up along with dn1, dn2, dn3. 
- NN sees 6 replicas for block b and correctly decides to remove 3 of them.
{{BlockPlacementPolicyDefault.chooseReplicaToDelete()}} selects three targets to be deleted based on the free space remaining on DNs deemed to posses replicas. 
dn1, dn2, dn3 are most likely to be the targets for replicas deletion because they have been on the cluster longer than do1, do2, do3 and therefore are likely to have less free space.
- Expectedly do1, do2, do3 malfunction again and go down shortly after reporting their blocks to NN.
- It will take 10 minutes for NN to recognize the fact that do1, do2, do3 are dead. By that time replicas will be removed from the good nodes, resulting in data loss.
This is the real story seen in production.
I verified that all major version are affected.
                
> Missing blocks due to bad DataNodes comming up and down.
> --------------------------------------------------------
>
>                 Key: HDFS-3368
>                 URL: https://issues.apache.org/jira/browse/HDFS-3368
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>
> All replicas of a block can be removed if bad DataNodes come up and down during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira