hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-3368) Missing blocks due to bad DataNodes comming up and down.
Date Fri, 04 May 2012 07:43:18 GMT

     [ https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Konstantin Shvachko updated HDFS-3368:
--------------------------------------

         Description: All replicas of a block can be removed if bad DataNodes come up and
down during cluster restart resulting in data loss.  (was: All replicas of a block can be
removed if bad DataNodes come up and down during cluter restart resulting in data loss.)
    Target Version/s: 0.22.1, 2.0.0, 3.0.0  (was: 3.0.0, 2.0.0, 0.22.1)

- A block b has 3 replicas initially located on DNs do1, do2, do3.
- At different times all three nodes malfunctioned and died, causing the replicas to be migrate
to dn1, dn2, dn3.
- do1, do2, do3 were not added to the exclude list.
And when the cluster restarts do1, do2, do3 are brought up along with dn1, dn2, dn3. 
- NN sees 6 replicas for block b and correctly decides to remove 3 of them.
{{BlockPlacementPolicyDefault.chooseReplicaToDelete()}} selects three targets to be deleted
based on the free space remaining on DNs deemed to posses replicas. 
dn1, dn2, dn3 are most likely to be the targets for replicas deletion because they have been
on the cluster longer than do1, do2, do3 and therefore are likely to have less free space.
- Expectedly do1, do2, do3 malfunction again and go down shortly after reporting their blocks
to NN.
- It will take 10 minutes for NN to recognize the fact that do1, do2, do3 are dead. By that
time replicas will be removed from the good nodes, resulting in data loss.
This is the real story seen in production.
I verified that all major version are affected.
                
> Missing blocks due to bad DataNodes comming up and down.
> --------------------------------------------------------
>
>                 Key: HDFS-3368
>                 URL: https://issues.apache.org/jira/browse/HDFS-3368
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>
> All replicas of a block can be removed if bad DataNodes come up and down during cluster
restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message