hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11616) Namenode doesn't mark the block as non-corrupt if the reason for corruption was INVALID_STATE
Date Mon, 11 Sep 2017 05:18:01 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Junping Du updated HDFS-11616:
------------------------------
    Target Version/s: 2.8.3  (was: 2.8.1)

> Namenode doesn't mark the block as non-corrupt if the reason for corruption was INVALID_STATE
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11616
>                 URL: https://issues.apache.org/jira/browse/HDFS-11616
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.7.3
>            Reporter: Rushabh S Shah
>
> Due to power failure event, we hit HDFS-5042.
> We lost many racks across the cluster.
> There were couple of missing blocks.
> For a  given missing block, following is the output of fsck.
> {noformat}
> [hdfs@XXX rushabhs]$ hdfs fsck -blockId blk_8566436445
> Connecting to namenode via http://nn1:50070/fsck?ugi=hdfs&blockId=blk_8566436445+&path=%2F
> FSCK started by hdfs (auth:KERBEROS_SSL) from XXX at Mon Apr 03 16:22:48 UTC 2017
> Block Id: blk_8566436445
> Block belongs to: <file>
> No. of Expected Replica: 3
> No. of live Replica: 0
> No. of excess Replica: 0
> No. of stale Replica: 0
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 3
> Block replica on datanode/rack: datanodeA is CORRUPT	 ReasonCode: INVALID_STATE
> Block replica on datanode/rack: datanodeB is CORRUPT	 ReasonCode: INVALID_STATE
> Block replica on datanode/rack: datanodeC is CORRUPT	 ReasonCode: INVALID_STATE
> {noformat}
> After the power event, when we restarted the datanode, the blocks were in rbw directory.
> When full block report is sent to namenode, all the blocks from rbw directory gets converted
into RWR state and the namenode marked it as corrupt with reason Reason.INVALID_STATE.
> After sometime (in this case after 31 hours) when I went to recover missing blocks, I
noticed the following things.
> All the datanodes has their copy of the block in rbw directory but the file was complete
according to namenode.
> All the replicas had the right size and correct genstamp and {{hdfs debug verify}} command
also succeeded.
> I went to dnA and moved the block from rbw directory to finalized directory.
> Restarted the datanode (making sure the replicas file was not present during startup).
> I forced a FBR and made sure the datanode block reported to namenode.
> After waiting for sometime, still that block was missing.
> I expected the missing block to go away since the replica is in FINALIZED directory.
> On investigating more, I found out that namenode will remove the replica from corrupt
map only if the reason for corruption was {{GENSTAMP_MISMATCH}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message