hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3891) QJM: SBN fails if selectInputStreams throws RTE
Date Wed, 05 Sep 2012 22:22:08 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449197#comment-13449197

Todd Lipcon commented on HDFS-3891:

Currently, the behavior of FileJournalManager, if it sees an error in {{selectInputStreams}},
is to simply return no streams.

The behavior of QuorumJournalManager is currently to throw an RTE, which is what caused the
NN to exit:

2012-09-02 13:55:21,208 FATAL org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown
error encountered while tailing edits. Shutting down standby NN.
java.lang.RuntimeException: java.io.IOException: Timed out waiting 20000 for write quorum
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:391)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:245)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1130)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:199)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:311)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:269)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:286)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:451)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:282)
Caused by: java.io.IOException: Timed out waiting 20000 for write quorum
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:150)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:388)
        ... 8 more

Clearly this is not the right behavior :)

So, it seems there are two options:
1) Have QJM.selectInputStreams simply log a WARN and return no streams (to match existing
FileJournalManager behavior)
2) Change the JournalManager API so that {{selectInputStreams}} throws an IOException, and
have the logging happen at a higher level when retries are the right behavior.

I am thinking that #2 might be the right longer term fix, but for now the simple fix (option
1) will at least make QJM and FJM act the same. Then we can separately fix this class of issues
in the JM API generally in trunk.
> QJM: SBN fails if selectInputStreams throws RTE
> -----------------------------------------------
>                 Key: HDFS-3891
>                 URL: https://issues.apache.org/jira/browse/HDFS-3891
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: name-node
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
> Currently, QJM's {{selectInputStream}} method throws an RTE if a quorum cannot be reached.
This propagates into the Standby Node and causes the whole node to crash. It should handle
this error appropriately.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message