hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ChenFolin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-4423) Checkpoint exception causes fatal damage to fsimage.
Date Mon, 21 Jan 2013 02:38:12 GMT
ChenFolin created HDFS-4423:
-------------------------------

             Summary: Checkpoint exception causes fatal damage to fsimage.
                 Key: HDFS-4423
                 URL: https://issues.apache.org/jira/browse/HDFS-4423
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 1.1.1, 1.0.4
         Environment: CentOS 6.2
            Reporter: ChenFolin
            Priority: Blocker


The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.java
{code}
boolean loadFSImage(MetaRecoveryContext recovery) throws IOException {
...
latestNameSD.read();
    needToSave |= loadFSImage(getImageFile(latestNameSD, NameNodeFile.IMAGE));
    LOG.info("Image file of size " + imageSize + " loaded in " 
        + (FSNamesystem.now() - startTime)/1000 + " seconds.");
    
    // Load latest edits
    if (latestNameCheckpointTime > latestEditsCheckpointTime)
      // the image is already current, discard edits
      needToSave |= true;
    else // latestNameCheckpointTime == latestEditsCheckpointTime
      needToSave |= (loadFSEdits(latestEditsSD, recovery) > 0);
    
    return needToSave;
  }
{code}
If it is the normal flow of the checkpoint,the value of latestNameCheckpointTime  is equal
to the value of latestEditsCheckpointTime,and it will exec “else”.
The problem is that,latestNameCheckpointTime > latestEditsCheckpointTime:
SecondNameNode starts checkpoint,
...
NameNode:rollFSImage,NameNode shutdown after write latestNameCheckpointTime and before
write latestEditsCheckpointTime.
Start NameNode:because latestNameCheckpointTime > latestEditsCheckpointTime,so the
value of needToSave is true, and it will not update “rootDir”'s nsCount that is the
cluster's file number(update exec at loadFSEdits “FSNamesystem.getFSNamesystem().dir.updateCountForINodeWithQuota()”),and
then “saveNamespace” will write file number to fsimage whit default value “1”。
The next time,loadFSImage will fail.

Maybe,it will work:
{code}
boolean loadFSImage(MetaRecoveryContext recovery) throws IOException {
...
latestNameSD.read();
    needToSave |= loadFSImage(getImageFile(latestNameSD, NameNodeFile.IMAGE));
    LOG.info("Image file of size " + imageSize + " loaded in " 
        + (FSNamesystem.now() - startTime)/1000 + " seconds.");
    
    // Load latest edits
    if (latestNameCheckpointTime > latestEditsCheckpointTime){
      // the image is already current, discard edits
      needToSave |= true;
      FSNamesystem.getFSNamesystem().dir.updateCountForINodeWithQuota();
    }
    else // latestNameCheckpointTime == latestEditsCheckpointTime
      needToSave |= (loadFSEdits(latestEditsSD, recovery) > 0);
    
    return needToSave;
  }
{code}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message