hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinayakumar B (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7952) On starting Standby with "rollback" option, lastPromisedEpoch gets updated and Active Namenode is shutting down.
Date Fri, 10 Apr 2015 05:37:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14488956#comment-14488956

Vinayakumar B commented on HDFS-7952:

Hi [~andreina]

with HDFS-7934, you cannot even start standby NN with rollback option.
So I believe you went ahead with the patch applied available in that jira.

Below are the reason for the failures as I found.

1. ZKFCs were kept running/started already before re-starting NNs with rollback option.
2. When the first NN1 was rolling back, it calls, 
              {{this.editLog.discardSegments(discardSegmentTxId)}}; to discard the extra edits
added during upgrade.
3. And NN1 becomes Active immediately.
4. Now NN2 restarted with '-rollingUpgrade rollback' option, this also will try to call 
              {{this.editLog.discardSegments(discardSegmentTxId)}};  and it continues to start
5. Now even though there are no edits to clean in journal nodes, {{discardSegments(..)}} call
from standby NN2 will change the epoch in journalnodes which results in Active shutdown.

There is two workaround for this, 
1.  keep ZKFC stopped until both namenodes are rolledback completely. This needs a document
2. Another way is checking for extra segments in QuorumJournalManager itself before issueing
discardSegments() call to journal nodes from standby NN. But in this case, there is a possibility
that already rolled back Active NN might have created new segments, these can be deleted by
So I will go with document update, i.e. *both ZKFCs should be started only after both NNs
are rolled back"

Any thoughts?

> On starting Standby with "rollback" option, lastPromisedEpoch gets updated and Active
Namenode is shutting down.
> ----------------------------------------------------------------------------------------------------------------
>                 Key: HDFS-7952
>                 URL: https://issues.apache.org/jira/browse/HDFS-7952
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: J.Andreina
>            Assignee: J.Andreina
>            Priority: Critical
> Step 1: Start NN1 as active , NN2 as standby .
> Step 2: Perform "hdfs dfsadmin -rollingUpgrade prepare"
> Step 3: Start NN2 active and NN1 as standby with rolling upgrade started option.
> Step 4: DN also restarted in upgrade mode and write files to hdfs
> Step 5: Stop both Namenode and DN
> Step 6: Restart NN2 as active and NN1 as standby with rolling upgrade rollback option.

> Issue:
> =====
> On restarting NN1 as standby with "rollback" option , lastPromisedEpoch gets updated
and active NN2 is shutting down with following exception.
> {noformat}
> 15/03/18 16:25:56 FATAL namenode.FSEditLog: Error: flush failed for required journal
(JournalAndStream(mgr=QJM to [XXXXXXXXXXX:8485, YYYYYYYYYYY:8485], stream=QuorumOutputStream
starting at txid 22))
> org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve
quorum size 2/2. 2 exceptions thrown:
> XXXXXXXXXXX:8485: IPC's epoch 5 is less than the last promised epoch 6
> at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
> {noformat}

This message was sent by Atlassian JIRA

View raw message