hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gordon Wang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-6636) NameNode should remove block replica out from corrupted replica map when adding block under construction
Date Tue, 08 Jul 2014 02:36:34 GMT
Gordon Wang created HDFS-6636:
---------------------------------

             Summary: NameNode should remove block replica out from corrupted replica map
when adding block under construction
                 Key: HDFS-6636
                 URL: https://issues.apache.org/jira/browse/HDFS-6636
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.2.0
            Reporter: Gordon Wang


In our test environment, we found the namenode can not handle incremental block report correctly
when the block replica is under construction and the replica is marked as corrupt.
Here is our scenario.
*the block had 3 replica by default. But because one datanode was down, the available replica
for the block was 2. Say the alive datanode is DN1 and DN2.
*client tried to append data to the block. And during appending, something was wrong with
the pipeline. Then, client did the pipeline recovery, only one datanode DN1 is in the pipeline
now.
*For some unknown reason(might be the IO error), DN2 got checksum error when receiving block
data from DN1, then DN2 reported the replica on DN1 as bad block to NameNode. But actually,
client was appending data to replica on DN1, and the replica is good.
*NameNode marked replica on DN1 as corrupt.
*When client finished appending, DN1 checked the data in the replica, and the replica is OK.
Then, DN1 finalized the replica, DN1 reported the block as received block to NameNode.
*NameNode handled the incremental block report form DN1, because the block is under construction.
NameNode called the addStoredBlockUnderConstruction in block manager. But as the replica on
DN1 was never removed from the corrupted block. The number of alive replica for the block
was 0, and the number of corrupt replica was 1.
*client could not complete the file because the number of alive replicas for the last block
was smaller than minimal replica number. 
 




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message