hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hairong Kuang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1
Date Mon, 01 Dec 2008 19:13:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652102#action_12652102
] 

Hairong Kuang commented on HADOOP-4742:
---------------------------------------

Yes, I think this is indeed a problem. The proposed solution should be able to fix the problem.

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.18.3
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message