hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manoj Govindassamy (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-10819) BlockManager fails to store a good block for a datanode storage after it reported a corrupt block — block replication stuck
Date Wed, 31 Aug 2016 19:05:20 GMT

     [ https://issues.apache.org/jira/browse/HDFS-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Manoj Govindassamy updated HDFS-10819:
--------------------------------------
    Attachment: HDFS-10819.001.patch

*Problem:*
— BlockManager reports incorrect replica count for a file block even after successful replication
to all replicas, 
— TestDataNodeHotSwapVolumes fails with “TimeoutException: Timed out waiting for /test
to reach 3 replicas” error

*Analysis:*
- Client wrote data to DN1 as part of the initial write pipeline DN1 -> Dn2 -> DN3 
— DN1 persisted (say in storage volume *S1*) the block BLK_xyz_001, mirrored the block to
downstreams and was waiting for the ack back.
- Later, one of the storage volumes in DN1 (say S2) was removed. Client detects pipeline issue,
triggers pipeline recovery and gets the new write pipeline as DN2 —> DN3 
— On a successful {{FSNameSystem::updatePipeline}} request from Client, NameNode bumps up
the Generation Stamp (from 001 to 002) of the UnderConstruction (that is, the last) block
of the file.
- Client writes the new allocated Block BLK_xyz_002 to the new write pipeline nodes. (DN2
and DN3) 
- Client closed the file stream. NameNode ran the LowRedundancy checker for all the blocks
in the file. Detected the block BLK_xyz having a replication factor of 2 Vs the expected 3.
- NameNode asked DN2 to replicate BLK_xyz_002 to DN1. Say DN1 persisted BLK_xyz_002 onto storage
volume *S1* again.
- Now DN1 sends IBR to NameNode with the RECEIVED_BLOCK info about BLK_xyz_002 on *S1*

- BlockManager processed the incremental block report from DN1, was trying to store (metadata)
the block BLK_xyz_002 for DN1 on storage *S1*
- But, DN1 S1 already had BLK_xyz_001 and was marked corrupted later as part of pipeline update.
The check at line 2878 thus failed.
- So, when a storage had a corrupt block and later when the same storage reported a good block,
BlockManager fails to update block --> datanode mapping and prune neededReconstruction
list. Refer: {{BlockManager::addStoredBlock}}

{noformat}

  2871   void addStoredBlockUnderConstruction(StatefulBlockInfo ucBlock,    
  2872       DatanodeStorageInfo storageInfo) throws IOException {    
  2873     BlockInfo block = ucBlock.storedBlock;
  2874     block.getUnderConstructionFeature().addReplicaIfNotPresent(    
  2875         storageInfo, ucBlock.reportedBlock, ucBlock.reportedState);
  2876 
  2877     if (ucBlock.reportedState == ReplicaState.FINALIZED &&
  2878         (block.findStorageInfo(storageInfo) < 0)) {    
  2879       addStoredBlock(block, ucBlock.reportedBlock, storageInfo, null, true);
  2880     }   
  2881   }   

{noformat}

- Replication Monitor which runs continuously tried to reconstruct the block on DN1, but {{BlockPlacementPolicyDefault}}
failed to find the choose the same target

{noformat}
1148 2016-08-25 18:21:19,853 [ReplicationMonitor] WARN  net.NetworkTopology (NetworkTopology.java:chooseRandom(816))
- Failed to find datanode (scope="" excludedScope="/default-rack").
1149 2016-08-25 18:21:19,853 [ReplicationMonitor] WARN  blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(402)) - Failed to place enough replicas, still
in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information,
please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
1150 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN  net.NetworkTopology (NetworkTopology.java:chooseRandom(816))
- Failed to find datanode (scope="" excludedScope="/default-rack").
1151 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN  blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(402)) - Failed to place enough replicas, still
in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7,
storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false)
For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
1152 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN  protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161))
- Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected
(replication=3, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK],      policy=BlockStoragePolicy{HOT:7,
storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
1153 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN  blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(402)) - Failed to place enough replicas, still
in need of 1 to reach 3 (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7,
storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false)
All required storage types are unavailable:  unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7,
storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
1154 2016-08-25 18:21:19,854 [ReplicationMonitor] DEBUG BlockStateChange (BlockManager.java:computeReconstructionWorkForBlocks(1680))
- BLOCK* neededReconstruction = 1 pendingReconstruction = 0
{noformat}


*Fix:*

- {{BlockManager::addStoredBlockUnderConstruction}} should not check for block --> datanode
storage mapping for invoking BlockManager::addStoredBlock
- {{BlockManager::addStoredBlock}} already handles Block addition/replacement/already_exists
cases. And, more importantly it also prunes the {{LowRedundancyBlocks}} list

Attached patch has the fix. Also, updated the unit test TestDataNodeHotSwapVolumes#testRemoveVolumeBeingWrittenForDatanode
to expose race conditions which helped to recreate the above problem frequently. With the
proposed fix, BlockManager handles the case properly and the test passes.




> BlockManager fails to store a good block for a datanode storage after it reported a corrupt
block — block replication stuck
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10819
>                 URL: https://issues.apache.org/jira/browse/HDFS-10819
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Manoj Govindassamy
>            Assignee: Manoj Govindassamy
>         Attachments: HDFS-10819.001.patch
>
>
> TestDataNodeHotSwapVolumes occasionally fails in the unit test testRemoveVolumeBeingWrittenForDatanode.
 Data write pipeline can have issues as there could be timeouts, data node not reachable etc,
and in this test case it was more of induced one as one of the volumes in a datanode is removed
while block write is in progress. Digging further in the logs, when the problem happens in
the write pipeline, the error recovery is not happening as expected leading to block replication
never catching up.
> Though this problem has same signature as in HDFS-10780, from the logs it looks like
the code paths taken are totally different and so the root cause could be different as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message