hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6094) The same block can be counted twice towards safe mode threshold
Date Fri, 14 Mar 2014 01:02:08 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13934384#comment-13934384
] 

Jing Zhao commented on HDFS-6094:
---------------------------------

I can also reproduce the issue on my local machine. Looks like the issue is:
1. After the standby NN restarts, DN1 sends first the incremental block report then the complete
block report to SBN.
2. DN2 sends the incremental block report to SBN. This block report will not change the replica
number in SBN because the corresponding storage ID has not been added in SBN yet (the storage
ID will only be added during the full block report processing). However, the SBN still checks
the current live replica number (which is 1 because SBN already received the full block report
from DN1) and use the number to update the safe block count.

So maybe a simple fix can be:
{code}
@@ -2277,7 +2277,7 @@ private Block addStoredBlock(final BlockInfo block,
     if(storedBlock.getBlockUCState() == BlockUCState.COMMITTED &&
         numLiveReplicas >= minReplication) {
       storedBlock = completeBlock(bc, storedBlock, false);
-    } else if (storedBlock.isComplete()) {
+    } else if (storedBlock.isComplete() && added) {
       // check whether safe replication is reached for the block
       // only complete blocks are counted towards that
       // Is no-op if not in safe mode.
{code}

> The same block can be counted twice towards safe mode threshold
> ---------------------------------------------------------------
>
>                 Key: HDFS-6094
>                 URL: https://issues.apache.org/jira/browse/HDFS-6094
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.4.0
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>
> {{BlockManager#addStoredBlock}} can cause the same block can be counted towards safe
mode threshold. We see this manifest via {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}}
failures on Ubuntu. More details to follow in a comment.
> Exception details:
> {code}
>   Time elapsed: 12.874 sec  <<< FAILURE!
> java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported blocks
7 has reached the threshold 0.9990 of total blocks 6. The number of live datanodes 3 has reached
the minimum number 0. Safe mode will be turned off automatically in 28 seconds.'
>         at org.junit.Assert.fail(Assert.java:93)
>         at org.junit.Assert.assertTrue(Assert.java:43)
>         at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493)
>         at org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message