hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Singhi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3157) Error in deleting block is keep on coming from DN even after the block report and directory scanning has happened
Date Sun, 13 May 2012 10:46:49 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274245#comment-13274245
] 

Ashish Singhi commented on HDFS-3157:
-------------------------------------

Currently I am working on the following solution for the patch - Rebuilding the blockInfo
just with reported block genstamp and other all states same as storedBlock.
Again with this solution, the test case may randomly fail. Reason, 
Now though the reported block is added into corruptReplicasMap it is not getting invalidated
on the DN who is reporting this corrupt block, because for the corrupt block to get invalidated
first we need to meet the live replicas for the block equal to the replication factor set.
Problem - If chooseTarget() picks the same DN who is reporting this corrupt block then it
will fail with ReplicaAlreadyExistsException.
Now question is why NN is picking the same DN who is reporting this corrupt block not the
3rd DN ?
Answer - In excludedNodes map only one DN will be present who has the live replica of the
block( or who has the block in his Finalized folder).
The following partial logs depicits the above scenario.
{code}
excludedNodes contains the following datanode/s.
{127.0.0.1:54681=127.0.0.1:54681}
2012-05-12 23:57:33,773 INFO  hdfs.StateChange (BlockManager.java:computeReplicationWorkForBlocks(1226))
- BLOCK* ask 127.0.0.1:54681 to replicate blk_3471690017167574595_1003 to datanode(s) 127.0.0.1:54041
2012-05-12 23:57:33,791 INFO  datanode.DataNode (DataNode.java:transferBlock(1221)) - DatanodeRegistration(127.0.0.1,
storageID=DS-1047816814-192.168.44.128-54681-1336847251649, infoPort=62840, ipcPort=26036,
storageInfo=lv=-40;cid=testClusterID;nsid=1646783488;c=0) Starting thread to transfer block
BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003 to 127.0.0.1:54041
2012-05-12 23:57:33,795 INFO  hdfs.StateChange (BlockManager.java:processReport(1450)) - BLOCK*
processReport: from DatanodeRegistration(127.0.0.1, storageID=DS-1047816814-192.168.44.128-54681-1336847251649,
infoPort=62840, ipcPort=26036, storageInfo=lv=-40;cid=testClusterID;nsid=1646783488;c=0),
blocks: 1, processing time: 0 msecs
2012-05-12 23:57:33,796 INFO  datanode.DataNode (BPServiceActor.java:blockReport(404)) - BlockReport
of 1 blocks took 0 msec to generate and 2 msecs for RPC and NN processing
2012-05-12 23:57:33,796 INFO  datanode.DataNode (BPServiceActor.java:blockReport(423)) - sent
block report, processed command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@12eb0b3
2012-05-12 23:57:33,811 INFO  datanode.DataNode (DataXceiver.java:writeBlock(342)) - Receiving
block BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003 src: /127.0.0.1:33583
dest: /127.0.0.1:54041
2012-05-12 23:57:33,812 INFO  datanode.DataNode (DataXceiver.java:writeBlock(495)) - opWriteBlock
BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003 received exception
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003
already exists in state RBW and thus cannot be created.
2012-05-12 23:57:33,814 ERROR datanode.DataNode (DataXceiver.java:run(193)) - 127.0.0.1:54041:DataXceiver
error processing WRITE_BLOCK operation  src: /127.0.0.1:33583 dest: /127.0.0.1:54041
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003
already exists in state RBW and thus cannot be created.
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:795)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.&lt;init&gt;(BlockReceiver.java:151)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:365)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:189)
        at java.lang.Thread.run(Thread.java:619)
2012-05-12 23:57:33,815 INFO  datanode.DataNode (DataNode.java:run(1406)) - DataTransfer:
Transmitted BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003 (numBytes=100)
to /127.0.0.1:54041
2012-05-12 23:57:34,066 INFO  hdfs.StateChange (BlockManager.java:processReport(1450)) - BLOCK*
processReport: from DatanodeRegistration(127.0.0.1, storageID=DS-610636930-192.168.44.128-20029-1336847250644,
infoPort=52843, ipcPort=46734, storageInfo=lv=-40;cid=testClusterID;nsid=1646783488;c=0),
blocks: 0, processing time: 0 msecs
2012-05-12 23:57:34,067 INFO  datanode.DataNode (BPServiceActor.java:blockReport(404)) - BlockReport
of 0 blocks took 0 msec to generate and 3 msecs for RPC and NN processing
2012-05-12 23:57:34,068 INFO  datanode.DataNode (BPServiceActor.java:blockReport(423)) - sent
block report, processed command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@a1364a
2012-05-12 23:57:34,099 INFO  hdfs.StateChange (CorruptReplicasMap.java:addToCorruptReplicasMap(66))
- BLOCK NameSystem.addToCorruptReplicasMap: blk_3471690017167574595 added as corrupt on 127.0.0.1:54041
by /127.0.0.1 because reported RBW replica with genstamp 1002 does not match COMPLETE block&apos;s
genstamp in block map 1003
2012-05-12 23:57:34,100 INFO  hdfs.StateChange (BlockManager.java:processReport(1450)) - BLOCK*
processReport: from DatanodeRegistration(127.0.0.1, storageID=DS-1452741455-192.168.44.128-54041-1336847250645,
infoPort=10314, ipcPort=16230, storageInfo=lv=-40;cid=testClusterID;nsid=1646783488;c=0),
blocks: 1, processing time: 2 msecs
2012-05-12 23:57:34,101 INFO  datanode.DataNode (BPServiceActor.java:blockReport(404)) - BlockReport
of 1 blocks took 0 msec to generate and 4 msecs for RPC and NN processing
2012-05-12 23:57:34,101 INFO  datanode.DataNode (BPServiceActor.java:blockReport(423)) - sent
block report, processed command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@17194a4
2012-05-12 23:57:34,775 INFO  hdfs.StateChange (BlockManager.java:computeReplicationWorkForBlocks(1096))
- BLOCK* Removing block blk_3471690017167574595_1003 from neededReplications as it has enough
replicas. 
{code}
Here you can observe that NN is picking the same DN 127.0.0.1:54041 for replication who is
reporting the corrupt block and excludedNodes map has only one DN 127.0.0.1:54681 who is having
the live replica(printed on the first line of the logs).

Is there any way to add the DN who is reporting the corrupt block in the excludedNodes map
?
                
> Error in deleting block is keep on coming from DN even after the block report and directory
scanning has happened
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3157
>                 URL: https://issues.apache.org/jira/browse/HDFS-3157
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: J.Andreina
>            Assignee: Ashish Singhi
>             Fix For: 2.0.0, 3.0.0
>
>         Attachments: HDFS-3157.patch, HDFS-3157.patch, HDFS-3157.patch
>
>
> Cluster setup:
> 1NN,Three DN(DN1,DN2,DN3),replication factor-2,"dfs.blockreport.intervalMsec" 300,"dfs.datanode.directoryscan.interval"
1
> step 1: write one file "a.txt" with sync(not closed)
> step 2: Delete the blocks in one of the datanode say DN1(from rbw) to which replication
happened.
> step 3: close the file.
> Since the replication factor is 2 the blocks are replicated to the other datanode.
> Then at the NN side the following cmd is issued to DN from which the block is deleted
> -------------------------------------------------------------------------------------
> {noformat}
> 2012-03-19 13:41:36,905 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap:
duplicate requested for blk_2903555284838653156 to add as corrupt on XX.XX.XX.XX by /XX.XX.XX.XX
because reported RBW replica with genstamp 1002 does not match COMPLETE block's genstamp in
block map 1003
> 2012-03-19 13:41:39,588 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* Removing block
blk_2903555284838653156_1003 from neededReplications as it has enough replicas.
> {noformat}
> From the datanode side in which the block is deleted the following exception occured
> {noformat}
> 2012-02-29 13:54:13,126 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected
error trying to delete block blk_2903555284838653156_1003. BlockInfo not found in volumeMap.
> 2012-02-29 13:54:13,126 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing
datanode Command
> java.io.IOException: Error in deleting blocks.
> 	at org.apache.hadoop.hdfs.server.datanode.FSDataset.invalidate(FSDataset.java:2061)
> 	at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:581)
> 	at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:545)
> 	at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:690)
> 	at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:522)
> 	at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:662)
> 	at java.lang.Thread.run(Thread.java:619)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message