hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhouyingchao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-10240) Race between close/recoverLease leads to missing block
Date Thu, 31 Mar 2016 08:34:25 GMT

     [ https://issues.apache.org/jira/browse/HDFS-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhouyingchao updated HDFS-10240:
--------------------------------
    Attachment: HDFS-10240-001.patch

If the patch is ok, I would add some unit tests.

> Race between close/recoverLease leads to missing block
> ------------------------------------------------------
>
>                 Key: HDFS-10240
>                 URL: https://issues.apache.org/jira/browse/HDFS-10240
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: zhouyingchao
>            Assignee: zhouyingchao
>         Attachments: HDFS-10240-001.patch
>
>
> We got a missing block in our cluster, and logs related to the missing block are as follows:
> 2016-03-28,10:00:06,188 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock:
XXXXXXXXXX. BP-219149063-10.108.84.25-1446859315800 blk_1226490256_153006345{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]}
> 2016-03-28,10:00:06,205 INFO BlockStateChange: BLOCK* blk_1226490256_153006345{blockUCState=UNDER_RECOVERY,
primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]}
recovery started, primary=ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]
> 2016-03-28,10:00:06,205 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease:
File XXXXXXXXXX has not been closed. Lease recovery is in progress. RecoveryId = 153006357
for block blk_1226490256_153006345{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]}
> 2016-03-28,10:00:06,248 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK*
checkFileProgress: blk_1226490256_153006345{blockUCState=COMMITTED, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-85819f0d-bdbb-4a9b-b90c-eba078547c23:NORMAL|RBW]]}
has not reached minimal replication 1
> 2016-03-28,10:00:06,358 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated:
10.114.5.53:11402 is added to blk_1226490256_153006345{blockUCState=COMMITTED, primaryNodeIndex=2,
replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-85819f0d-bdbb-4a9b-b90c-eba078547c23:NORMAL|RBW]]}
size 139
> 2016-03-28,10:00:06,441 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated:
10.114.5.44:11402 is added to blk_1226490256_153006345 size 139
> 2016-03-28,10:00:06,660 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated:
10.114.6.14:11402 is added to blk_1226490256_153006345 size 139
> 2016-03-28,10:00:08,808 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-219149063-10.108.84.25-1446859315800:blk_1226490256_153006345,
newgenerationstamp=153006357, newlength=139, newtargets=[10.114.6.14:11402, 10.114.5.53:11402,
10.114.5.44:11402], closeFile=true, deleteBlock=false)
> 2016-03-28,10:00:08,836 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap:
blk_1226490256 added as corrupt on 10.114.6.14:11402 by /10.114.6.14 because block is COMPLETE
and reported genstamp 153006357 does not match genstamp in block map 153006345
> 2016-03-28,10:00:08,836 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap:
blk_1226490256 added as corrupt on 10.114.5.53:11402 by /10.114.5.53 because block is COMPLETE
and reported genstamp 153006357 does not match genstamp in block map 153006345
> 2016-03-28,10:00:08,837 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap:
blk_1226490256 added as corrupt on 10.114.5.44:11402 by /10.114.5.44 because block is COMPLETE
and reported genstamp 153006357 does not match genstamp in block map 153006345
> From the log, I guess this is what has happened in order:
> 1  Process A open a file F for write.
> 2  Somebody else called recoverLease against F.
> 3  A closed F.
> The root cause of the missing block is that recoverLease increased gen count of blocks
on datanode whereas the gen count on Namenode is reset by close in step 3. I think we should
check if the last block is under recovery when trying to close a file. If so we should just
throw an exception to client quickly.
> Although the issue is found on a 2.4 HDFS, it looks like the issue also exist on the
trunk from code. I'll post a patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message