hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gordon Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6505) Can not close file due to last block is marked as corrupt
Date Tue, 10 Jun 2014 03:16:02 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026085#comment-14026085
] 

Gordon Wang commented on HDFS-6505:
-----------------------------------

This issue causes the last block is missing and the file is corrupted. But actually, the data
on DataNode is correct.

I went through the code, and I think some safe check is missing when namenode receives a bad
block report from datanodes.
See the following code snippet in namenode BlockManager
{code}
  public void findAndMarkBlockAsCorrupt(final ExtendedBlock blk,
      final DatanodeInfo dn, String storageID, String reason) throws IOException {
    assert namesystem.hasWriteLock();
    final BlockInfo storedBlock = getStoredBlock(blk.getLocalBlock());
    if (storedBlock == null) {
      // Check if the replica is in the blockMap, if not
      // ignore the request for now. This could happen when BlockScanner
      // thread of Datanode reports bad block before Block reports are sent
      // by the Datanode on startup
      blockLog.info("BLOCK* findAndMarkBlockAsCorrupt: "
          + blk + " not found");
      return;
    }
    markBlockAsCorrupt(new BlockToMarkCorrupt(storedBlock, reason,
        Reason.CORRUPTION_REPORTED), dn, storageID);
  }
{code} 
We should check the timestamp in reported block and stored block. If the reported block has
a smaller timestamp, this block should not be marked as corrupt. It is possible that the reported
block has a smaller timestamp when client has done some work on recovering pipeline.

> Can not close file due to last block is marked as corrupt
> ---------------------------------------------------------
>
>                 Key: HDFS-6505
>                 URL: https://issues.apache.org/jira/browse/HDFS-6505
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>            Reporter: Gordon Wang
>
> After appending a file, client could not close it. Because namenode could not complete
the last block in file. The UC status of last block remained as COMMIT and never change.
> The namenode log was like this.
> {code}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* checkFileProgress: blk_1073741920_13948{blockUCState=COMMITTED,
primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.28.1.2:50010|RBW]]} has not reached minimal replication
1
> {code}
> After going through the log of namenode, I found a log like this
> {code}
> INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_1073741920 added
as corrupt on 172.28.1.2:50010 by sdw3/172.28.1.3 because client machine reported it
> {code}
> But actually, the last block was finished successfully in the data node. Because I could
find the log in datanode
> {code}
> INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer: Transmitted BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13808
(numBytes=50120352) to /172.28.1.3:50010
> INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /172.28.1.2:36860,
dest: /172.28.1.2:50010, bytes: 51686616, op: HDFS_WRITE, cliID: libhdfs3_client_random_741511239_count_1_pid_215802_tid_140085714196576,
offset: 0, srvID: DS-2074102060-172.28.1.2-50010-1401432768690, blockid: BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13948,
duration: 189226453336
> INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13948,
type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message