Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A1521200BD9 for ; Fri, 9 Dec 2016 20:48:02 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A0009160AFD; Fri, 9 Dec 2016 19:48:02 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4F234160B1D for ; Fri, 9 Dec 2016 20:48:01 +0100 (CET) Received: (qmail 85637 invoked by uid 500); 9 Dec 2016 19:47:59 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 85610 invoked by uid 99); 9 Dec 2016 19:47:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Dec 2016 19:47:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id BB57E2C03DE for ; Fri, 9 Dec 2016 19:47:59 +0000 (UTC) Date: Fri, 9 Dec 2016 19:47:59 +0000 (UTC) From: "Wei-Chiu Chuang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-11225) NameNode crashed because deleteSnapshot held FSNamesystem lock too long MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 09 Dec 2016 19:48:02 -0000 [ https://issues.apache.org/jira/browse/HDFS-11225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-11225: ----------------------------------- Description: The deleteSnapshot operation is synchronous. In certain situations this operation may hold FSNamesystem lock for too long, bringing almost every NameNode operation to a halt. We have observed one incidence where it took so long that ZKFC believes the NameNode is down. All other IPC threads were waiting to acquire FSNamesystem lock. This specific deleteSnapshot took ~70 seconds. ZKFC has connection timeout of 45 seconds by default, and if all IPC threads wait for FSNamesystem lock and can't accept new incoming connection, ZKFC times out, advances epoch and NameNode will therefore lose its active NN role and then fail. Relevant log: {noformat} Thread 154 (IPC Server handler 86 on 8020): State: RUNNABLE Blocked count: 2753455 Waited count: 89201773 Stack: org.apache.hadoop.hdfs.server.namenode.INode$BlocksMapUpdateInfo.addDeleteBlock(INode.java:879) org.apache.hadoop.hdfs.server.namenode.INodeFile.destroyAndCollectBlocks(INodeFile.java:508) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) org.apache.hadoop.hdfs.server.namenode.INodeReference.destroyAndCollectBlocks(INodeReference.java:339) org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.destroyAndCollectBlocks(INodeReference.java:606) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyDeletedList(DirectoryWithSnapshotFeature.java:119) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$400(DirectoryWithSnapshotFeature.java:61) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:319) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:167) org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.deleteSnapshotDiff(AbstractINodeDiffList.java:83) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:745) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:747) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:789) {noformat} After the ZKFC determined NameNode was down and advanced epoch, the NN finished deleting snapshot, and sent the edit to journal nodes, but it was rejected because epoch was updated. See the following stacktrace: {noformat} 10.0.16.21:8485: IPC's epoch 17 is less than the last promised epoch 18 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:457) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:352) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81) at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223) at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:8507) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1469) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.deleteSnapshot(AuthorizationProviderProxyClientProtocol.java:717) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.deleteSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1061) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) {noformat} Finally NameNode shut itself down because it had too many quorum errors. Setting priority to critical because it resulted in NameNode crash. We think deleteSnapshot should be made asynchronous. It should delete the root of snapshot directory, and then put the rest of work into an asynchronous thread. Credit: [~yzhangal] for proposing this idea. was: The deleteSnapshot operation is synchronous. In certain situations this operation may hold FSNamesystem lock for too long, bringing almost every NameNode operation to a halt. We have observed one incidence where it took so long that ZKFC believes the NameNode is down. All other IPC threads were waiting to acquire FNamesystem lock. Relevant log: {noformat} Thread 154 (IPC Server handler 86 on 8020): State: RUNNABLE Blocked count: 2753455 Waited count: 89201773 Stack: org.apache.hadoop.hdfs.server.namenode.INode$BlocksMapUpdateInfo.addDeleteBlock(INode.java:879) org.apache.hadoop.hdfs.server.namenode.INodeFile.destroyAndCollectBlocks(INodeFile.java:508) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) org.apache.hadoop.hdfs.server.namenode.INodeReference.destroyAndCollectBlocks(INodeReference.java:339) org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.destroyAndCollectBlocks(INodeReference.java:606) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyDeletedList(DirectoryWithSnapshotFeature.java:119) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$400(DirectoryWithSnapshotFeature.java:61) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:319) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:167) org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.deleteSnapshotDiff(AbstractINodeDiffList.java:83) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:745) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747) org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:747) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747) org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:789) {noformat} After the ZKFC determined NameNode was down and advanced epoch, the NN finished deleting snapshot, and sent the edit to journal nodes, but it was rejected because epoch was updated. See the following stacktrace: {noformat} 10.0.16.21:8485: IPC's epoch 17 is less than the last promised epoch 18 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:457) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:352) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81) at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223) at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:8507) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1469) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.deleteSnapshot(AuthorizationProviderProxyClientProtocol.java:717) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.deleteSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1061) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) {noformat} Finally NameNode shut itself down because it had too many quorum errors. Setting priority to critical because it resulted in NameNode crash. We think deleteSnapshot should be made asynchronous. It should delete the root of snapshot directory, and then put the rest of work into an asynchronous thread. Credit: [~yzhangal] for proposing this idea. > NameNode crashed because deleteSnapshot held FSNamesystem lock too long > ----------------------------------------------------------------------- > > Key: HDFS-11225 > URL: https://issues.apache.org/jira/browse/HDFS-11225 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.4.0 > Environment: CDH5.8.2, HA > Reporter: Wei-Chiu Chuang > Priority: Critical > Labels: high-availability > > The deleteSnapshot operation is synchronous. In certain situations this operation may hold FSNamesystem lock for too long, bringing almost every NameNode operation to a halt. > We have observed one incidence where it took so long that ZKFC believes the NameNode is down. All other IPC threads were waiting to acquire FSNamesystem lock. This specific deleteSnapshot took ~70 seconds. ZKFC has connection timeout of 45 seconds by default, and if all IPC threads wait for FSNamesystem lock and can't accept new incoming connection, ZKFC times out, advances epoch and NameNode will therefore lose its active NN role and then fail. > Relevant log: > {noformat} > Thread 154 (IPC Server handler 86 on 8020): > State: RUNNABLE > Blocked count: 2753455 > Waited count: 89201773 > Stack: > org.apache.hadoop.hdfs.server.namenode.INode$BlocksMapUpdateInfo.addDeleteBlock(INode.java:879) > org.apache.hadoop.hdfs.server.namenode.INodeFile.destroyAndCollectBlocks(INodeFile.java:508) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763) > org.apache.hadoop.hdfs.server.namenode.INodeReference.destroyAndCollectBlocks(INodeReference.java:339) > org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.destroyAndCollectBlocks(INodeReference.java:606) > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyDeletedList(DirectoryWithSnapshotFeature.java:119) > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$400(DirectoryWithSnapshotFeature.java:61) > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:319) > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:167) > org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.deleteSnapshotDiff(AbstractINodeDiffList.java:83) > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:745) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747) > org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:747) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747) > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:789) > {noformat} > After the ZKFC determined NameNode was down and advanced epoch, the NN finished deleting snapshot, and sent the edit to journal nodes, but it was rejected because epoch was updated. See the following stacktrace: > {noformat} > 10.0.16.21:8485: IPC's epoch 17 is less than the last promised epoch 18 > at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429) > at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:457) > at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:352) > at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149) > at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) > at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) > at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81) > at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223) > at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142) > at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) > at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) > at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) > at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) > at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) > at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) > at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) > at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:8507) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1469) > at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.deleteSnapshot(AuthorizationProviderProxyClientProtocol.java:717) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.deleteSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1061) > at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) > {noformat} > Finally NameNode shut itself down because it had too many quorum errors. > Setting priority to critical because it resulted in NameNode crash. > We think deleteSnapshot should be made asynchronous. It should delete the root of snapshot directory, and then put the rest of work into an asynchronous thread. Credit: [~yzhangal] for proposing this idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org