hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen O'Donnell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11215) Log Namenode thread dump on unexpected exits
Date Fri, 16 Dec 2016 16:28:58 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15754869#comment-15754869
] 

Stephen O'Donnell commented on HDFS-11215:
------------------------------------------

Hey [~apurtell] - I wasn't particularly thinking of front line operators for this 'feature'.
My thinking was more that if you are getting repeated crashes and the cause is not clear,
this could be something to give your support vendor for them to dig into.

> Log Namenode thread dump on unexpected exits
> --------------------------------------------
>
>                 Key: HDFS-11215
>                 URL: https://issues.apache.org/jira/browse/HDFS-11215
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Stephen O'Donnell
>
> With HA namenodes, reasonably frequently we can see a namenode exit due to a quorum of
journals not responding inside the 20 second timeout, for example:
> {code}
> 2016-11-29 01:43:22,969  WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:
Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
> 2016-11-29 01:43:23,954  FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error:
flush failed for required journal (JournalAndStream(mgr=QJM to [10.x.x.x:8485, 10.x.x.x:8485,
10.x.x.x:8485], stream=QuorumOutputStream starting at txid 762756576))
> 2016-11-29 01:43:23,954  FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error:
flush failed for required journal (JournalAndStream(mgr=QJM to [10.x.x.x:8485, 10.x.x.x:8485,
10.x.x.x:8485], stream=QuorumOutputStream starting at txid 762756576))
> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
> 	at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
> 	at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> 	at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> 	at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> 	at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> 	at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> 	at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> 	at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> 	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2687)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2559)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:592)
> 	at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:110)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:395)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
> 2016-11-29 01:43:23,955  WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:
Aborting QuorumOutputStream starting at txid 762756576
> 2016-11-29 01:43:23,987  INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
> 2016-11-29 01:43:24,003  INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

> {code}
> When this happens, it can often be a mystery around what caused it. Eg, was it a network
issue, was the thread blocked waiting on a response from a KDC (due to no kdc_timeout set),
was it slow disk (on the Journal or even log4j on the Namenode), etc.
> I wonder if it would make sense to log a thread dump to the Namenodes log just before
exiting, as this may give some clues about what caused the lack of response from the Journals?
> There are at least two ways to do this:
> 1. In JournalSet.java, it seems to log the fatal error and then call terminate - we could
log a thread dump here just before calling terminate
> 2. We could change the terminate method in ExitUtil.java to log a thread dump for any
terminate where the exit status is non zero, as there may be other scenarios where this could
be useful.
> There is existing code to log a thread dump in ReflectionUtils, so it could be reused
here.
> I could have a go at a patch if anyone things this is useful?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message