hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9406) FSImage may get corrupted after deleting snapshot
Date Mon, 22 Jan 2018 05:40:01 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333891#comment-16333891
] 

Yongjun Zhang commented on HDFS-9406:
-------------------------------------

Hi [~jingzhao],

We are seeing similar problem even with the fix of HDFS-9406.  Unfortunately we don't have
good fsimage + editlogs to reply to reproduce the corruption. I wonder if there is other cases
like you described below:

{quote}

However, if the WithName node is the last in the rename list and the DstRef node has already
been deleted (i.e., the above failure case), we should fall back to the normal case and still
clean the created list of the prior snapshot.

{quote}

Would really appreciate if you have more insight to share.

Thanks.

> FSImage may get corrupted after deleting snapshot
> -------------------------------------------------
>
>                 Key: HDFS-9406
>                 URL: https://issues.apache.org/jira/browse/HDFS-9406
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>         Environment: CentOS 6 amd64, CDH 5.4.4-1
> 2xCPU: Intel(R) Xeon(R) CPU E5-2640 v3
> Memory: 32GB
> Namenode blocks: ~700_000 blocks, no HA setup
>            Reporter: Stanislav Antic
>            Assignee: Yongjun Zhang
>            Priority: Major
>             Fix For: 2.8.0, 2.7.3, 3.0.0-alpha1
>
>         Attachments: HDFS-9406.001.patch, HDFS-9406.002.patch, HDFS-9406.003.patch, HDFS-9406.branch-2.7.patch
>
>
> FSImage corruption happened after HDFS snapshots were taken. Cluster was not used
> at that time.
> When namenode restarts it reported NULL pointer exception:
> {code}
> 15/11/07 10:01:15 INFO namenode.FileJournalManager: Recovering unfinalized segments in
/tmp/fsimage_checker_5857/fsimage/current
> 15/11/07 10:01:15 INFO namenode.FSImage: No edit log streams selected.
> 15/11/07 10:01:18 INFO namenode.FSImageFormatPBINode: Loading 1370277 INodes.
> 15/11/07 10:01:27 ERROR namenode.NameNode: Failed to start namenode.
> java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addChild(INodeDirectory.java:531)
>         at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.addToParent(FSImageFormatPBINode.java:252)
>         at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:202)
>         at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:261)
>         at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180)
>         at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:929)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:913)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:732)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:668)
>         at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:584)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:643)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:810)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:794)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1487)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1553)
> 15/11/07 10:01:27 INFO util.ExitUtil: Exiting with status 1
> {code}
> Corruption happened after "07.11.2015 00:15", and after that time blocks ~9300 blocks
were invalidated that shouldn't be.
> After recovering FSimage I discovered that around ~9300 blocks were missing.
> -I also attached log of namenode before and after corruption happened.-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message