hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Sekhon (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary failures
Date Thu, 30 Apr 2015 12:54:05 GMT
Hari Sekhon created HDFS-8298:
---------------------------------

             Summary: HA: NameNode should not shut down completely without quorum, doesn't
recover from temporary failures
                 Key: HDFS-8298
                 URL: https://issues.apache.org/jira/browse/HDFS-8298
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: ha, HDFS, namenode, qjm
    Affects Versions: 2.6.0
         Environment: HDP 2.2
            Reporter: Hari Sekhon


In an HDFS HA setup if there is a temporary problem with contacting journal nodes (eg. network
interruption), the NameNode shuts down entirely, when it should instead go in to a standby
mode so that it can stay online and retry to achieve quorum later.

If both NameNodes shut themselves off like this then even after the temporary network outage
is resolved, the entire cluster remains offline indefinitely until operator intervention,
whereas it could have self-repaired after re-contacting the journalnodes and re-achieving
quorum.

{code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398))
- Error: flush failed for required journal (JournalAndStre
am(mgr=QJM to [<ip>:8485, <ip>:8485, <ip>:8485], stream=QuorumOutputStream
starting at txid 54270281))
java.io.IOException: Interrupted waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
        at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
        at java.lang.Thread.run(Thread.java:745)
2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager (QuorumOutputStream.java:abort(72))
- Aborting QuorumOutputStream starting at txid 54270281
2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with
status 1
2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at <custom_scrubbed>/<ip>
************************************************************/{code}

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message