ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Fernandez (JIRA)" <>
Subject [jira] [Created] (AMBARI-9163) Intermittent Preparing NAMENODE fails during RU due to JOURNALNODE quorum not established
Date Thu, 15 Jan 2015 22:17:34 GMT
Alejandro Fernandez created AMBARI-9163:

             Summary: Intermittent Preparing NAMENODE fails during RU due to JOURNALNODE quorum
not established
                 Key: AMBARI-9163
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server
    Affects Versions: 2.0.0
            Reporter: Alejandro Fernandez
            Assignee: Alejandro Fernandez
            Priority: Blocker
             Fix For: 2.0.0

The active namenode shutdowns during the first call to get the safemode status.
su - hdfs -c 'hdfs dfsadmin -safemode get'

failed on connection exception: Connection refused; For more details

The active namenode shows the following during the same time window,
2015-01-15 00:35:04,233 WARN  client.QuorumJournalManager (
- Remote journal failed to write txns 52-52. Will try to write to this
JN again after the next log roll.
Can't write, no segment open
	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(
	at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(
	at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(
	at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(
	at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$
	at org.apache.hadoop.ipc.RPC$
	at org.apache.hadoop.ipc.Server$Handler$
	at org.apache.hadoop.ipc.Server$Handler$
	at Method)
	at org.apache.hadoop.ipc.Server$

	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
	at com.sun.proxy.$Proxy12.journal(Unknown Source)
	at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(
	at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$
	at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$
	at java.util.concurrent.ThreadPoolExecutor.runWorker(
	at java.util.concurrent.ThreadPoolExecutor$

This issue is intermittent because it depends on the behavior of the Journalnodes, so this
will require more work to the scripts.

Today, our orchestration restarts one Journalnode at a time. However, the current log segment
is null because it has not yet rolled to a new one, which can be forced by the command "hdfs
dfsadmin -rollEdit" and waiting til some conditions are true.

The runbook has  more details,
// Function to ensure all JNs are up and are functional
ensureJNsAreUp(Jn1, Jn2, Jn3) {
  rollEdits at the namenode // hdfs dfsadmin -rollEdit
  get “LastAppliedOrWrittenTxId” from NN jmx
  wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout
after 3 mins

// Before bringing down a journal node ensure that the other two journal nodes are up
for each JN {
  do upgrade of one JN


Root caused to:
line 344

This message was sent by Atlassian JIRA

View raw message