hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manoj Govindassamy (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-11225) NameNode crashed because deleteSnapshot held FSNamesystem lock too long
Date Fri, 14 Apr 2017 22:50:41 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969629#comment-15969629
] 

Manoj Govindassamy edited comment on HDFS-11225 at 4/14/17 10:49 PM:
---------------------------------------------------------------------

*Problem:*
* Unlike {{INodeDirectory}}, {{DirectoryWithSnapshotFeature}} doesn't have its children maintained
in a plain list. Instead, there is {{DirectoryDiffList}} which is a list of {{DirectoryDiff}}.
* On every new snapshot, {{DirectoryWithSnapshotFeature}} updates its diffList with a new
entry and all the subsequent file creation/deletions will be recorded in the last taken snapshot.
* So, the snap diff list which {{DirectoryWithSnapshotFeature}} maintains is more of Delta/Diff
of file creation and deletions since the last snapshot. This is deliberate design so as top
keep the order of snapshot creation to a constant.
* Snapshot deletion operation needs to visit all children files for the snapshot to reclaim
blocks, and {{DirectoryWithSnapshotFeature#DirectoryDiff#getChildrenList()}} is invoked to
get the list.
* To get the children list for any snapshot {{Sx}} under a directory,  all the snapshot diff
records after {{Sx}} are combined one by one sequentially, and reversed from the latest current
children list of the directory.
* So, listing children under a Snapshot {{Sx}} directory is the order of (#Snapshots after
Sx * #FileDiffs in each of those snapshots). With thousands of snapshots and with 100s of
thousands of files, this listing operations can easily consume 10s of seconds.
* Above all, all these operations are done by a single threads and one directory at a time,
in a recursive fashion. In my testing, I have seen snapshot deletion taking 45+ seconds with
a fairly unloaded NN.


was (Author: manojg):
*Problem:*
* Unlike {{INodeDirectory}}, {{DirectoryWithSnapshotFeature}} doesn't have its children maintained
in a plain list. Instead, there is {{DirectoryDiffList}} which is a list of {{DirectoryDiff}}.
* On every new snapshot, {{DirectoryWithSnapshotFeature}} updates its diffList with a new
entry and all the subsequent file creation/deletions will be recorded in the last taken snapshot.
* So, the snap diff list which {{DirectoryWithSnapshotFeature}} maintains is more of Delta/Diff
of file creation and deletions since the last snapshot. This is deliberate design so as top
keep the order of snapshot creation to a constant.
* Snapshot deletion operation needs to visit all children files for the snapshot to reclaim
blocks, and {{DirectoryWithSnapshotFeature#DirectoryDiff#getChildrenList()}} is invoked to
get the list.
* To get the children list for any snapshot {{Sx}} under a directory,  all the snapshot diff
records after {{Sx}} are combined and reversed from the current children list of the directory.
* So, listing children under a Snapshot {{Sx}} directory is the order of (#Snapshots after
Sx * #FileDiffs in each of those snapshots). With thousands of snapshots and with 100s of
thousands of files, this listing operations can easily consume 10s of seconds.
* Above all, all these operations are done by a single threads and one directory at a time,
in a recursive fashion. In my testing, I have seen snapshot deletion taking 45+ seconds with
a fairly unloaded NN.

> NameNode crashed because deleteSnapshot held FSNamesystem lock too long
> -----------------------------------------------------------------------
>
>                 Key: HDFS-11225
>                 URL: https://issues.apache.org/jira/browse/HDFS-11225
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.4.0
>         Environment: CDH5.8.2, HA
>            Reporter: Wei-Chiu Chuang
>            Assignee: Manoj Govindassamy
>            Priority: Critical
>              Labels: high-availability
>
> The deleteSnapshot operation is synchronous. In certain situations this operation may
hold FSNamesystem lock for too long, bringing almost every NameNode operation to a halt.
> We have observed one incidence where it took so long that ZKFC believes the NameNode
is down. All other IPC threads were waiting to acquire FSNamesystem lock. This specific deleteSnapshot
took ~70 seconds. ZKFC has connection timeout of 45 seconds by default, and if all IPC threads
wait for FSNamesystem lock and can't accept new incoming connection, ZKFC times out, advances
epoch and NameNode will therefore lose its active NN role and then fail.
> Relevant log:
> {noformat}
> Thread 154 (IPC Server handler 86 on 8020):
>   State: RUNNABLE
>   Blocked count: 2753455
>   Waited count: 89201773
>   Stack:
>     org.apache.hadoop.hdfs.server.namenode.INode$BlocksMapUpdateInfo.addDeleteBlock(INode.java:879)
>     org.apache.hadoop.hdfs.server.namenode.INodeFile.destroyAndCollectBlocks(INodeFile.java:508)
>     org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>     org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>     org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>     org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>     org.apache.hadoop.hdfs.server.namenode.INodeReference.destroyAndCollectBlocks(INodeReference.java:339)
>     org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.destroyAndCollectBlocks(INodeReference.java:606)
>     org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyDeletedList(DirectoryWithSnapshotFeature.java:119)
>     org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$400(DirectoryWithSnapshotFeature.java:61)
>     org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:319)
>     org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:167)
>     org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.deleteSnapshotDiff(AbstractINodeDiffList.java:83)
>     org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:745)
>     org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776)
>     org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747)
>     org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:747)
>     org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776)
>     org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747)
>     org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:789)
> {noformat}
> After the ZKFC determined NameNode was down and advanced epoch, the NN finished deleting
snapshot, and sent the edit to journal nodes, but it was rejected because epoch was updated.
See the following stacktrace:
> {noformat}
> 10.0.16.21:8485: IPC's epoch 17 is less than the last promised epoch 18
>         at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
>         at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:457)
>         at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:352)
>         at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
>         at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>         at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>         at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
>         at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
>         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
>         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:8507)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1469)
>         at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.deleteSnapshot(AuthorizationProviderProxyClientProtocol.java:717)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.deleteSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1061)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
> {noformat}
> Finally NameNode shut itself down because it had too many quorum errors.
> Setting priority to critical because it resulted in NameNode crash.
> We think deleteSnapshot should be made asynchronous. It should delete the root of snapshot
directory, and then put the rest of work into an asynchronous thread. Credit: [~yzhangal]
for proposing this idea.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message