hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4935) Manual leaving of safe mode may lead to data lost
Date Wed, 24 Dec 2008 00:00:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658999#action_12658999
] 

Konstantin Shvachko commented on HADOOP-4935:
---------------------------------------------

By "pending deletion queue" you mean {{excessReplicateMap}} I believe.
This happens when data-nodes are flaky or busy and name-node looses them and replicates blocks
to other nodes, but the old data-nodes come back and report extra (old) replicas for blocks
that have already been re-replicated.
Thus the block becomes over-replicated and therefore is placed into {{excessReplicateMap}}.

The data-nodes delete excessive replicas, but before the deletion is reported back to the
name-node the latter calls {{processMisReplicatedBlocks()}} which clears {{excessReplicateMap}}
in an attempt to restore it from scratch.
The error is that after clearing {{excessReplicateMap}} all replicas become valid from the
name-node point of view although some of them have already been deleted by data-nodes. So
if there were 6 replicas of the same block and before {{processMisReplicatedBlocks()}} first
three were deleted, then after {{processMisReplicatedBlocks()}} clears {{excessReplicateMap}}
the other three can be also removed and there will be no more replicas of the block.


> Manual leaving of safe mode may lead to data lost
> -------------------------------------------------
>
>                 Key: HADOOP-4935
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4935
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.3
>            Reporter: Hairong Kuang
>            Assignee: Konstantin Shvachko
>             Fix For: 0.18.3
>
>
> Due to HADOOP-4610, NameNode calculates mis-replicated blocks when leaving safe mode
manually, where it clears the pending deletion queue before it does the calculation. This
works fine when NameNode just starts but introduced a bug when NameNode is running for a while.
Clearing the pending deletion queue makes NameNode not able to distinguish valid replicas
from invalid ones, ie, the ones that have scheduled or dispatched for deletion. Therefore,
NameNode may mistakenly decide the block is over-replicated and choose all valid ones to delete.
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message