Return-Path: X-Original-To: apmail-ambari-dev-archive@www.apache.org Delivered-To: apmail-ambari-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CC1FEC5E0 for ; Fri, 16 Jan 2015 03:14:32 +0000 (UTC) Received: (qmail 59398 invoked by uid 500); 16 Jan 2015 03:14:34 -0000 Delivered-To: apmail-ambari-dev-archive@ambari.apache.org Received: (qmail 59360 invoked by uid 500); 16 Jan 2015 03:14:34 -0000 Mailing-List: contact dev-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ambari.apache.org Delivered-To: mailing list dev@ambari.apache.org Received: (qmail 59339 invoked by uid 99); 16 Jan 2015 03:14:34 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Jan 2015 03:14:34 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 130C01D4062; Fri, 16 Jan 2015 03:14:31 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============8198274976533170941==" MIME-Version: 1.0 Subject: Re: Review Request 29950: Intermittent Preparing NAMENODE fails during RU due to JOURNALNODE quorum not established From: "Jonathan Hurley" To: "Dmitro Lisnichenko" , "Srimanth Gunturi" , "Jonathan Hurley" , "Yurii Shylov" , "Nate Cole" , "Sid Wagle" , "Tom Beerbower" Cc: "Alejandro Fernandez" , "Ambari" Date: Fri, 16 Jan 2015 03:14:31 -0000 Message-ID: <20150116031431.23996.69520@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Jonathan Hurley" X-ReviewGroup: Ambari X-ReviewRequest-URL: https://reviews.apache.org/r/29950/ X-Sender: "Jonathan Hurley" References: <20150116000734.23995.36937@reviews.apache.org> In-Reply-To: <20150116000734.23995.36937@reviews.apache.org> Reply-To: "Jonathan Hurley" X-ReviewRequest-Repository: ambari --===============8198274976533170941== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit > On Jan. 15, 2015, 7:07 p.m., Jonathan Hurley wrote: > > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py, lines 55-56 > > > > > > This will not work in HA mode. The NameNode is a combination of `dfs.namenode.http-address`, the HA cluster name, and the `nn` identifier. Such as: > > > > dfs.namenode.http-address.c1ha.nn2 > > Alejandro Fernandez wrote: > With the current code, it returns a value like "c6408.ambari.apache.org:50070" > And the function get_jmx_data will convert it to something like "http://c6408.ambari.apache.org:50070/jmx", which does appear to work I still think that this is an issue. Consider the following; in my cluster, I have `hdfs-site/dfs.namenode.http-address` as `c6401.ambari.apache.org:50070` `hdfs-site/dfs.namenode.http-address.c1ha.nn1` as `c6401.ambari.apache.org:50071` My NameNode is on 50071, not 50070. We need to open a Jira to track this. Can you open one up? - Jonathan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29950/#review68366 ----------------------------------------------------------- On Jan. 15, 2015, 5:43 p.m., Alejandro Fernandez wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/29950/ > ----------------------------------------------------------- > > (Updated Jan. 15, 2015, 5:43 p.m.) > > > Review request for Ambari, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Srimanth Gunturi, Sid Wagle, Tom Beerbower, and Yurii Shylov. > > > Bugs: AMBARI-9163 > https://issues.apache.org/jira/browse/AMBARI-9163 > > > Repository: ambari > > > Description > ------- > > The active namenode shutdowns during the first call to get the safemode status. > ` > su - hdfs -c 'hdfs dfsadmin -safemode get' > ` > > returned > ` > failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused > ` > > The active namenode shows the following during the same time window, > ` > 2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll. > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open > at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470) > at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344) > at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) > at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) > at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > > at org.apache.hadoop.ipc.Client.call(Client.java:1468) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy12.journal(Unknown Source) > at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167) > at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385) > at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ` > > This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts. > > Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true. > > The runbook has more details, > ` > // Function to ensure all JNs are up and are functional > ensureJNsAreUp(Jn1, Jn2, Jn3) { > rollEdits at the namenode // hdfs dfsadmin -rollEdit > get “LastAppliedOrWrittenTxId” from NN jmx > wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins > } > > // Before bringing down a journal node ensure that the other two journal nodes are up > ensureJNsAreUp > for each JN { > do upgrade of one JN > ensureJNsAreUp > } > > ` > > Root caused to: > https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344 > > > Diffs > ----- > > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py 15e068947307a321566385fb670232af7f78d71b > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py PRE-CREATION > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py 93efae35281e7d3d175ecc95b3af4e531cf69b64 > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682 > > Diff: https://reviews.apache.org/r/29950/diff/ > > > Testing > ------- > > Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice. > Unit Tests passed, > > [INFO] ------------------------------------------------------------------------ > [INFO] BUILD SUCCESS > [INFO] ------------------------------------------------------------------------ > [INFO] Total time: 30:23.410s > [INFO] Finished at: Thu Jan 15 14:43:23 PST 2015 > [INFO] Final Memory: 61M/393M > [INFO] ------------------------------------------------------------------------ > > > Thanks, > > Alejandro Fernandez > > --===============8198274976533170941==--