ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Fernandez" <afernan...@hortonworks.com>
Subject Review Request 29950: Intermittent Preparing NAMENODE fails during RU due to JOURNALNODE quorum not established
Date Thu, 15 Jan 2015 22:21:06 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/
-----------------------------------------------------------

Review request for Ambari, Jonathan Hurley, Nate Cole, and Yurii Shylov.


Bugs: AMBARI-9163
    https://issues.apache.org/jira/browse/AMBARI-9163


Repository: ambari


Description
-------

The active namenode shutdowns during the first call to get the safemode status.
`
su - hdfs -c 'hdfs dfsadmin -safemode get'
`

returned
`
failed on connection exception: java.net.ConnectException: Connection refused; For more details
see:  http://wiki.apache.org/hadoop/ConnectionRefused
`

The active namenode shows the following during the same time window,
`
2015-01-15 00:35:04,233 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(388))
- Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this
JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException):
Can't write, no segment open
	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
	at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
	at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
	at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
	at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

	at org.apache.hadoop.ipc.Client.call(Client.java:1468)
	at org.apache.hadoop.ipc.Client.call(Client.java:1399)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
	at com.sun.proxy.$Proxy12.journal(Unknown Source)
	at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
	at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
	at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
`

This issue is intermittent because it depends on the behavior of the Journalnodes, so this
will require more work to the scripts.

Today, our orchestration restarts one Journalnode at a time. However, the current log segment
is null because it has not yet rolled to a new one, which can be forced by the command "hdfs
dfsadmin -rollEdit" and waiting til some conditions are true.

The runbook has  more details,
`
// Function to ensure all JNs are up and are functional
ensureJNsAreUp(Jn1, Jn2, Jn3) {
  rollEdits at the namenode // hdfs dfsadmin -rollEdit
  get “LastAppliedOrWrittenTxId” from NN jmx
  wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout
after 3 mins
}

// Before bringing down a journal node ensure that the other two journal nodes are up
ensureJNsAreUp
for each JN {
  do upgrade of one JN
  ensureJNsAreUp
}

`

Root caused to:
https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java
line 344


Diffs
-----

  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c

  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py
15e068947307a321566385fb670232af7f78d71b 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py
PRE-CREATION 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py
93efae35281e7d3d175ecc95b3af4e531cf69b64 
  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py
f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682 

Diff: https://reviews.apache.org/r/29950/diff/


Testing
-------

Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice.
Unit Tests are in progress.


Thanks,

Alejandro Fernandez


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message