hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9406) FSImage corruption after taking snapshot
Date Fri, 29 Jan 2016 19:01:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123987#comment-15123987
] 

Jing Zhao commented on HDFS-9406:
---------------------------------

Thanks for the patch, Yongjun! The patch looks good to me. But looks like we need to fix {{TestINodeFile#testClearBlocks}}
because of the new {{clearBlocks}} logic.

In the meanwhile, can we also add a test for the case you mentioned in this jira? Although
this one may not cause fsimage corruption, but we can check if the file has finally been deleted
from the inodeMap.
{code}
  @Test
  public void testRenameAndDelete() throws IOException {
    final Path foo = new Path("/foo");
    final Path x = new Path(foo, "x");
    final Path y = new Path(foo, "y");
    final Path trash = new Path("/trash");
    fs.mkdirs(x);
    fs.mkdirs(y);
    fs.mkdirs(trash);
    fs.allowSnapshot(foo);
    // 1. create snapshot s0
    fs.createSnapshot(foo, "s0");
    // 2. create file /foo/x/bar
    final Path file = new Path(x, "bar");
    DFSTestUtil.createFile(fs, file, BLOCKSIZE, (short) 1, 0L);
    final long fileId = fsdir.getINode4Write(file.toString()).getId();
    // 3. move file into /foo/y
    final Path newFile = new Path(y, "bar");
    fs.rename(file, newFile);
    // 4. create snapshot s1
    fs.createSnapshot(foo, "s1");
    // 5. move /foo/y to /trash
    final Path deletedY = new Path(trash, "y");
    fs.rename(y, deletedY);
    // 6. create snapshot s2
    fs.createSnapshot(foo, "s2");
    // 7. delete /trash/y
    fs.delete(deletedY, true);
    // 8. delete snapshot s1
    fs.deleteSnapshot(foo, "s1");
    // make sure bar has been cleaned
    Assert.assertNull(fsdir.getInode(fileId));
  }
{code}

> FSImage corruption after taking snapshot
> ----------------------------------------
>
>                 Key: HDFS-9406
>                 URL: https://issues.apache.org/jira/browse/HDFS-9406
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>         Environment: CentOS 6 amd64, CDH 5.4.4-1
> 2xCPU: Intel(R) Xeon(R) CPU E5-2640 v3
> Memory: 32GB
> Namenode blocks: ~700_000 blocks, no HA setup
>            Reporter: Stanislav Antic
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-9406.001.patch, HDFS-9406.002.patch
>
>
> FSImage corruption happened after HDFS snapshots were taken. Cluster was not used
> at that time.
> When namenode restarts it reported NULL pointer exception:
> {code}
> 15/11/07 10:01:15 INFO namenode.FileJournalManager: Recovering unfinalized segments in
/tmp/fsimage_checker_5857/fsimage/current
> 15/11/07 10:01:15 INFO namenode.FSImage: No edit log streams selected.
> 15/11/07 10:01:18 INFO namenode.FSImageFormatPBINode: Loading 1370277 INodes.
> 15/11/07 10:01:27 ERROR namenode.NameNode: Failed to start namenode.
> java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addChild(INodeDirectory.java:531)
>         at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.addToParent(FSImageFormatPBINode.java:252)
>         at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:202)
>         at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:261)
>         at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180)
>         at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:929)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:913)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:732)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:668)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:584)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:643)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:810)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:794)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1487)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1553)
> 15/11/07 10:01:27 INFO util.ExitUtil: Exiting with status 1
> {code}
> Corruption happened after "07.11.2015 00:15", and after that time blocks ~9300 blocks
were invalidated that shouldn't be.
> After recovering FSimage I discovered that around ~9300 blocks were missing.
> -I also attached log of namenode before and after corruption happened.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message