Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8B6AD1872B for ; Wed, 16 Sep 2015 00:02:52 +0000 (UTC) Received: (qmail 93332 invoked by uid 500); 16 Sep 2015 00:02:46 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 93271 invoked by uid 500); 16 Sep 2015 00:02:46 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 93256 invoked by uid 99); 16 Sep 2015 00:02:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Sep 2015 00:02:46 +0000 Date: Wed, 16 Sep 2015 00:02:45 +0000 (UTC) From: "Jing Zhao (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-9052) deleteSnapshot runs into AssertionError MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746534#comment-14746534 ] Jing Zhao commented on HDFS-9052: --------------------------------- Hi Alex, so the issue here is not about {{computeDiffBetweenSnapshots}} or deleting a snapshot. These are just possible cases that can expose the corrupted snapshot diff list. Let me try to provide more context information about snapshot diff list. In our current snapshot implementation, we record newly created files in create list and deleted files in the delete list. So let's suppose we take a snapshot s1, and then delete the file "useraction.log.crypto", since the file exists before creating snapshot s1, we have: {noformat} s1: deleted list: [INodeFile_1(useraction.log.crypto)] {noformat} Now we take another snapshot s2, and then create the new log file with the same name. s2's diff list looks like: {noformat} s2: created list: [INodeFile_2(useraction.log.crypto)] {noformat} We then take snapshot s3, and delete the log file. Now we have: {noformat} s1: created list:[], deleted list: [INodeFile_1(useraction.log.crypto)] s2: created list: [INodeFile_2(useraction.log.crypto)], deleted list: [] s3: created list: [], deleted list: [INodeFile_2(useraction.log.crypto)] {noformat} Let's say we now delete s3. The diff lists of s2 and s3 should be combined and because INodeFile_2(useraction.log.crypto) is created after taking s2, the correct diff lists should look like: {noformat} s1: created list: [], deleted list: [INodeFile_1(useraction.log.crypto)] s2: created list: [], deleted list: [] {noformat} But before HDFS-6908 we have a bug which caused INodeFile_2(useraction.log.crypto) still stayed in s2's deleted list. Then we have: {noformat} s1: deleted list: [INodeFile_1(useraction.log.crypto)] s2: deleted list: [INodeFile_2(useraction.log.crypto)] {noformat} Now we have a corrupted diff list state. No matter we compute snapshot diff between s1 and the current state, or delete the snapshot s2, in case that we have to combine s1 and s2, we will get the AssertionError. Because the corruption has been persisted in your fsimage, to fix the issue you may have to use a patched jar to remove the INodeFile_2(useraction.log.crypto) from s2's deleted list when loading the fsimage. > deleteSnapshot runs into AssertionError > --------------------------------------- > > Key: HDFS-9052 > URL: https://issues.apache.org/jira/browse/HDFS-9052 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Alex Ivanov > > CDH 5.0.5 upgraded from CDH 5.0.0 (Hadoop 2.3) > Upon deleting a snapshot, we run into the following assertion error. The scenario is as follows: > 1. We have a program that deletes snapshots in reverse chronological order. > 2. The program deletes a couple of hundred snapshots successfully but runs into the following exception: > java.lang.AssertionError: Element already exists: element=useraction.log.crypto, DELETED=[useraction.log.crypto] > 3. There seems to be an issue with that snapshot, which causes a file, which normally gets overwritten in every snapshot to be added to the SnapshotDiff delete queue twice. > 4. Once the deleteSnapshot is run on the problematic snapshot, if the Namenode is restarted, it cannot be started again until the transaction is removed from the EditLog. > 5. Sometimes the bad snapshot can be deleted but the prior snapshot seems to "inherit" the same issue. > 6. The error below is from Namenode starting when the DELETE_SNAPSHOT transaction is replayed from the EditLog. > 2015-09-01 22:59:59,140 INFO [IPC Server handler 0 on 8022] BlockStateChange (BlockManager.java:logAddStoredBlock(2342)) - BLOCK* addStoredBlock: blockMap updated: 10.52.209.77:1004 is added to blk_1080833995_7093259{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-16de62e5-f6e2-4ea7-aad9-f8567bded7d7:NORMAL|FINALIZED]]} size 0 > 2015-09-01 22:59:59,140 INFO [IPC Server handler 0 on 8022] BlockStateChange (BlockManager.java:logAddStoredBlock(2342)) - BLOCK* addStoredBlock: blockMap updated: 10.52.209.77:1004 is added to blk_1080833996_7093260{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-1def2b07-d87f-49dd-b14f-ef230342088d:NORMAL|FINALIZED]]} size 0 > 2015-09-01 22:59:59,141 ERROR [IPC Server handler 0 on 8022] namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(232)) - Encountered exception on operation DeleteSnapshotOp [snapshotRoot=/data/tenants/pdx-svt.baseline84/wddata, snapshotName=s2015022614_maintainer_soft_del, RpcClientId=7942c957-a7cf-44c1-880d-6eea690e1b19, RpcCallId=1] > 2015-09-01 22:59:59,141 ERROR [IPC Server handler 0 on 8022] namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(232)) - Encountered exception on operation DeleteSnapshotOp [snapshotRoot=/data/tenants/pdx-svt.baseline84/wddata, snapshotName=s2015022614_maintainer_soft_del, RpcClientId=7942c957-a7cf-44c1-880d-6eea690e1b19, RpcCallId=1] > java.lang.AssertionError: Element already exists: element=useraction.log.crypto, DELETED=[useraction.log.crypto] > at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193) > at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239) > at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462) > at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:293) > at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:303) > at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDeletedINode(DirectoryWithSnapshotFeature.java:531) > at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:823) > at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714) > at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:684) > at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:830) > at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714) > at org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.removeSnapshot(INodeDirectorySnapshottable.java:341) > at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.deleteSnapshot(SnapshotManager.java:238) > at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:667) > at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:224) > at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:133) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:802) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:783) -- This message was sent by Atlassian JIRA (v6.3.4#6332)