hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11019) Inconsistent number of corrupt replicas if a corrupt replica is reported multiple times
Date Mon, 17 Oct 2016 13:52:58 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582321#comment-15582321
] 

Kihwal Lee commented on HDFS-11019:
-----------------------------------

[~kshukla] has debugged and worked on this quite a bit. She might have an idea.

> Inconsistent number of corrupt replicas if a corrupt replica is reported multiple times
> ---------------------------------------------------------------------------------------
>
>                 Key: HDFS-11019
>                 URL: https://issues.apache.org/jira/browse/HDFS-11019
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>         Environment: CDH5.7.2 
>            Reporter: Wei-Chiu Chuang
>
> While investigating a block corruption issue, I found the following warning message in
the namenode log:
> {noformat}
> (a client reports a block replica is corrupt)
> 2016-10-12 10:07:37,166 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap:
blk_1073803461 added as corrupt on 10.0.0.63:50010 by /10.0.0.62  because client machine reported
it
> 2016-10-12 10:07:37,166 INFO BlockStateChange: BLOCK* invalidateBlock: blk_1073803461_74513(stored=blk_1073803461_74553)
on 10.0.0.63:50010
> 2016-10-12 10:07:37,166 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_1073803461_74513
to 10.0.0.63:50010
> (another client reports a block replica is corrupt)
> 2016-10-12 10:07:37,728 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap:
blk_1073803461 added as corrupt on 10.0.0.63:50010 by /10.0.0.64  because client machine reported
it
> 2016-10-12 10:07:37,728 INFO BlockStateChange: BLOCK* invalidateBlock: blk_1073803461_74513(stored=blk_1073803461_74553)
on 10.0.0.63:50010
> (ReplicationMonitor thread kicks in to invalidate the replica and add a new one)
> 2016-10-12 10:07:37,888 INFO BlockStateChange: BLOCK* ask 10.0.0.56:50010 to replicate
blk_1073803461_74553 to datanode(s) 10.0.0.63:50010
> 2016-10-12 10:07:37,888 INFO BlockStateChange: BLOCK* BlockManager: ask 10.0.0.63:50010
to delete [blk_1073803461_74513]
> (the two maps are inconsistent)
> 2016-10-12 10:08:00,335 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Inconsistent number of corrupt replicas for blk_1073803461_74553 blockMap has 0 but corrupt
replicas map has 1
> {noformat}
> It seems that when a corrupt block replica is reported twice, blockMap corrupt and corrupt
replica map becomes inconsistent.
> Looking at the log, I suspect the bug is in {{BlockManager#removeStoredBlock}}. When
a corrupt replica is reported, BlockManager removes the block from blocksMap. If the block
is already removed (that is, the corrupt replica is reported twice), return; Otherwise (that
is, the corrupt replica is reported the first time), remove the block from corruptReplicasMap
(The block is added into corruptReplicasMap in BlockerManager#markBlockAsCorrupt) Therefore,
after the second corruption report, the corrupt replica is removed from blocksMap, but the
one in corruptReplicasMap is not removed.
> I can’t tell what’s the impact that they are inconsistent. But I feel it's a good
idea to fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message