hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3493) Replication is not happened for the block (which is recovered and in finalized) to the Datanode which has got the same block with old generation timestamp in RBW
Date Mon, 04 Jun 2012 05:52:23 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288357#comment-13288357
] 

Uma Maheswara Rao G commented on HDFS-3493:
-------------------------------------------

By seeing the HDFS-2290, this is an existing behaviour currently. Until unless, it replicate
new block, it won't delete the corrupt block. 
Already asserted in some test cases in HDFS-2290.

see test from HDFS-2290:

{code}
/**
+   * The corrupt block has to be removed when the number of valid replicas
+   * matches replication factor for the file. In this test, the above 
+   * condition is achieved by increasing the number of good replicas by 
+   * replicating on a new Datanode. 
+   * The test strategy : 
+   *   Bring up Cluster with 3 DataNodes
+   *   Create a file  of replication factor 3
+   *   Corrupt one replica of a block of the file 
+   *   Verify that there are still 2 good replicas and 1 corrupt replica 
+   *     (corrupt replica should not be removed since number of good replicas
+   *      (2) is less  than replication factor (3)) 
+   *   Start a new data node 
+   *   Verify that the a new replica is created and corrupt replica is
+   *   removed.
+   * 
+   */
+  @Test
+  public void testByAddingAnExtraDataNode() throws IOException {
{code}


the below condition will not allow to invalidate block, even if we corrupt one more block.

{code}
    node.addBlock(storedBlock);

    // Add this replica to corruptReplicas Map
    corruptReplicas.addToCorruptReplicasMap(storedBlock, node, reason);
    if (countNodes(storedBlock).liveReplicas() >= bc.getReplication()) {
{code}

Replication factor 3, corrupt replica is 1, and live replicas are 2.
So, the above condition will not satisfy. It will just add into neededReplications. Since
we don't have one more DN here, it will not be able to replicate also.

Here my concern is, if we corrupt one more block, 
Replication factor 3, corrupt replicas are 2, and live replica is 1.
Now also it will not be able to replicate and invalidate, we may end up running with one good
replica, even though we have 3 DNs in cluster. In this smaller cluster, this is risk. In bigger
clusters this problem will not come because, we will have some more nodes in cluster, so,
the replication will happen suucessfully. This is only problem when we have cluster size is
equal to replication factor at that moment.
                
> Replication is not happened for the block (which is recovered and in finalized) to the
Datanode which has got the same block with old generation timestamp in RBW
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3493
>                 URL: https://issues.apache.org/jira/browse/HDFS-3493
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.1-alpha
>            Reporter: J.Andreina
>
> replication factor= 3, block report interval= 1min and start NN and 3DN
> Step 1:Write a file without close and do hflush (Dn1,DN2,DN3 has blk_ts1)
> Step 2:Stopped DN3
> Step 3:recovery happens and time stamp updated(blk_ts2)
> Step 4:close the file
> Step 5:blk_ts2 is finalized and available in DN1 and Dn2
> Step 6:now restarted DN3(which has got blk_ts1 in rbw)
> From the NN side there is no cmd issued to DN3 to delete the blk_ts1 . But ask DN3 to
make the block as corrupt .
> Replication of blk_ts2 to DN3 is not happened.
> NN logs:
> ========
> {noformat}
> INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate
requested for blk_3927215081484173742 to add as corrupt on XX.XX.XX.XX:50276 by /XX.XX.XX.XX
because reported RWR replica with genstamp 1007 does not match COMPLETE block's genstamp in
block map 1008
> INFO org.apache.hadoop.hdfs.StateChange: BLOCK* processReport: from DatanodeRegistration(XX.XX.XX.XX,
storageID=DS-443871816-XX.XX.XX.XX-50276-1336829714197, infoPort=50275, ipcPort=50277, storageInfo=lv=-40;cid=CID-e654ac13-92dc-4f82-a22b-c0b6861d06d7;nsid=2063001898;c=0),
blocks: 2, processing time: 1 msecs
> INFO org.apache.hadoop.hdfs.StateChange: BLOCK* Removing block blk_3927215081484173742_1008
from neededReplications as it has enough replicas.
> INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate
requested for blk_3927215081484173742 to add as corrupt on XX.XX.XX.XX:50276 by /XX.XX.XX.XX
because reported RWR replica with genstamp 1007 does not match COMPLETE block's genstamp in
block map 1008
> INFO org.apache.hadoop.hdfs.StateChange: BLOCK* processReport: from DatanodeRegistration(XX.XX.XX.XX,
storageID=DS-443871816-XX.XX.XX.XX-50276-1336829714197, infoPort=50275, ipcPort=50277, storageInfo=lv=-40;cid=CID-e654ac13-92dc-4f82-a22b-c0b6861d06d7;nsid=2063001898;c=0),
blocks: 2, processing time: 1 msecs
> WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to
place enough replicas, still in need of 1 to reach 1
> For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> {noformat}
> fsck Report
> ===========
> {noformat}
> /file21:  Under replicated BP-1008469586-XX.XX.XX.XX-1336829603103:blk_3927215081484173742_1008.
Target Replicas is 3 but found 2 replica(s).
> .Status: HEALTHY
>  Total size:	495 B
>  Total dirs:	1
>  Total files:	3
>  Total blocks (validated):	3 (avg. block size 165 B)
>  Minimally replicated blocks:	3 (100.0 %)
>  Over-replicated blocks:	0 (0.0 %)
>  Under-replicated blocks:	1 (33.333332 %)
>  Mis-replicated blocks:		0 (0.0 %)
>  Default replication factor:	1
>  Average block replication:	2.0
>  Corrupt blocks:		0
>  Missing replicas:		1 (14.285714 %)
>  Number of data-nodes:		3
>  Number of racks:		1
> FSCK ended at Sun May 13 09:49:05 IST 2012 in 9 milliseconds
> The filesystem under path '/' is HEALTHY
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message