ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-9163) Intermittent Preparing NAMENODE fails during RU due to JOURNALNODE quorum not established
Date Fri, 16 Jan 2015 00:22:35 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279582#comment-14279582
] 

Hudson commented on AMBARI-9163:
--------------------------------

FAILURE: Integrated in Ambari-trunk-Commit-docker #777 (See [https://builds.apache.org/job/Ambari-trunk-Commit-docker/777/])
AMBARI-9163. Intermittent Preparing NAMENODE fails during RU due to JOURNALNODE quorum not
established (alejandro) (afernandez: http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=fcce59e8ca5410ea1ac893b21406861c3d003a80)
* ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py
* ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py
* ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py
* ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py
* ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml


> Intermittent Preparing NAMENODE fails during RU due to JOURNALNODE quorum not established
> -----------------------------------------------------------------------------------------
>
>                 Key: AMBARI-9163
>                 URL: https://issues.apache.org/jira/browse/AMBARI-9163
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.0.0
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>            Priority: Blocker
>             Fix For: 2.0.0
>
>         Attachments: AMBARI-9163.patch
>
>
> The active namenode shutdowns during the first call to get the safemode status.
> {code}
> su - hdfs -c 'hdfs dfsadmin -safemode get'
> {code}
> returned
> {code}
> failed on connection exception: java.net.ConnectException: Connection refused; For more
details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> {code}
> The active namenode shows the following during the same time window,
> {code}
> 2015-01-15 00:35:04,233 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(388))
- Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this
JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException):
Can't write, no segment open
> 	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
> 	at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
> 	at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
> 	at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
> 	at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> 	at com.sun.proxy.$Proxy12.journal(Unknown Source)
> 	at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
> 	at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
> 	at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> This issue is intermittent because it depends on the behavior of the Journalnodes, so
this will require more work to the scripts.
> Today, our orchestration restarts one Journalnode at a time. However, the current log
segment is null because it has not yet rolled to a new one, which can be forced by the command
"hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
> The runbook has  more details,
> {code}
> // Function to ensure all JNs are up and are functional
> ensureJNsAreUp(Jn1, Jn2, Jn3) {
>   rollEdits at the namenode // hdfs dfsadmin -rollEdit
>   get “LastAppliedOrWrittenTxId” from NN jmx
>   wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout
after 3 mins
> }
> // Before bringing down a journal node ensure that the other two journal nodes are up
> ensureJNsAreUp
> for each JN {
>   do upgrade of one JN
>   ensureJNsAreUp
> }
> {code}
> Root caused to:
> https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java
line 344



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message