hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wellington Chevreuil (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-11821) BlockManager.getMissingReplOneBlocksCount() does not report correct value if corrupt file with replication factor of 1 gets deleted
Date Sat, 13 May 2017 23:35:04 GMT
Wellington Chevreuil created HDFS-11821:
-------------------------------------------

             Summary: BlockManager.getMissingReplOneBlocksCount() does not report correct
value if corrupt file with replication factor of 1 gets deleted
                 Key: HDFS-11821
                 URL: https://issues.apache.org/jira/browse/HDFS-11821
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: hdfs
    Affects Versions: 3.0.0-alpha2
            Reporter: Wellington Chevreuil
            Assignee: Wellington Chevreuil
            Priority: Minor


*BlockManager* keeps a separate metric for number of missing blocks with replication factor
of 1. This is returned by *BlockManager.getMissingReplOneBlocksCount()* method currently,
and that's what is displayed on below attribute for *dfsadmin -report* (in below example,
there's one corrupt block that relates to a file with replication factor of 1):

{noformat}
...
Missing blocks (with replication factor 1): 1
...
{noformat}

However, if the related file gets deleted, (for instance, using hdfs fsck -delete option),
this metric never gets updated, and *dfsadmin -report* will keep reporting a missing block,
even though the file does not exist anymore. The only workaround available is to restart the
NN, so that this metric will be cleared.

This can be easily reproduced by forcing a replication factor 1 file corruption such as follows:

1) Put a file into hdfs with replication factor 1:

{noformat}
$ hdfs dfs -Ddfs.replication=1 -put test_corrupt /
$ hdfs dfs -ls /

-rw-r--r--   1 hdfs     supergroup         19 2017-05-10 09:21 /test_corrupt

{noformat}

2) Find related block for the file and delete it from DN:

{noformat}
$ hdfs fsck /test_corrupt -files -blocks -locations

...
/test_corrupt 19 bytes, 1 block(s):  OK
0. BP-782213640-172.31.113.82-1494420317936:blk_1073742742_1918 len=19 Live_repl=1 [DatanodeInfoWithStorage[172.31.112.178:20002,DS-a0dc0b30-a323-4087-8c36-26ffdfe44f46,DISK]]

Status: HEALTHY
...

$ find /dfs/dn/ -name blk_1073742742*

/dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742
/dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742_1918.meta

$ rm -rf /dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742
$ rm -rf /dfs/dn/current/BP-782213640-172.31.113.82-1494420317936/current/finalized/subdir0/subdir3/blk_1073742742_1918.meta

{noformat}

3) Running fsck will report the corruption as expected:

{noformat}
$ hdfs fsck /test_corrupt -files -blocks -locations

...
/test_corrupt 19 bytes, 1 block(s): 
/test_corrupt: CORRUPT blockpool BP-782213640-172.31.113.82-1494420317936 block blk_1073742742
 MISSING 1 blocks of total size 19 B
...
Total blocks (validated):	1 (avg. block size 19 B)
  ********************************
  UNDER MIN REPL'D BLOCKS:	1 (100.0 %)
  dfs.namenode.replication.min:	1
  CORRUPT FILES:	1
  MISSING BLOCKS:	1
  MISSING SIZE:		19 B
  CORRUPT BLOCKS: 	1
...
{noformat}

4) Same for *dfsadmin -report*

{noformat}
$ hdfs dfsadmin -report
...
Under replicated blocks: 1
Blocks with corrupt replicas: 0
Missing blocks: 1
Missing blocks (with replication factor 1): 1
...
{noformat}

5) Running *fsck -delete* option does cause fsck to report correct information about corrupt
block, but dfsadmin still shows the corrupt block:

{noformat}

$ hdfs fsck /test_corrupt -delete
...
$ hdfs fsck /
...
The filesystem under path '/' is HEALTHY
...

$ hdfs dfsadmin -report
...
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 1
...
{noformat}

The problem seems to be on *BlockManager.removeBlock()* method, which in turn uses util class
*LowRedundancyBlocks* that classifies blocks according to the current replication level, including
blocks currently marked as corrupt. 

The related metric showed on *dfsadmin -report* for corrupt blocks with replication factor
1 is tracked on this *LowRedundancyBlocks*. Whenever a block is marked as corrupt and it has
replication factor of 1, the related metric is updated. When removing the block, though, *BlockManager.removeBlock()*
is calling *LowRedundancyBlocks.remove(BlockInfo block, int priLevel)*, which does not check
if the given block was previously marked as corrupt and had replication factor 1, which would
require for updating the metric.

Am shortly proposing a patch that seems to fix this by making *BlockManager.removeBlock()*
 call *LowRedundancyBlocks.BlockInfo block, int oldReplicas, int oldReadOnlyReplicas, int
outOfServiceReplicas, int oldExpectedReplicas)* instead, which does update the metric properly.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org


Mime
View raw message