Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Mon, 11 Apr 2016 20:22:26 +0000 (UTC)
From: "Jian Fang (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12600766.1343696709000.197512.1460406146296@Atlassian.JIRA>
In-Reply-To: <JIRA.12600766.1343696709000@Atlassian.JIRA>
References: <JIRA.12600766.1343696709000@Atlassian.JIRA>
 <JIRA.12600766.1343696709225@arcas>
Subject: [jira] [Commented] (HDFS-3743) QJM: improve formatting behavior for
 JNs
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-3743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235906#comment-15235906 ] 

Jian Fang commented on HDFS-3743:
---------------------------------

I implemented the "-newEditsOnly" option and tested it on a cluster with 3 instances that ran server side daemons such as name nodes, resource managers, journal nodes, and others. I replaced two out of the three instances successfully with name node running fine by issuing the command "hdfs namenode -initializeSharedEdits -newEditsOnly" before started a name node. However, I ran into an issue when I replaced the last existing instance, i.e., replaced the original m1,m2, and m3 with m4, m5, and m6 one at a time. The name node suddenly stuck in loading fsimage.

The error message was as follows.

"
2016-04-04 06:51:34,612 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode (main): Loading 1 INodes.
2016-04-04 06:51:34,670 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf (main): Loaded FSImage in 0 seconds.
2016-04-04 06:51:34,670 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Loaded image for txid 0 from /mnt/var/lib/hadoop/dfs-name/current/fsimage_0000000000000000000
2016-04-04 06:51:34,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@3af6b48f expecting start txid #1
2016-04-04 06:51:34,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Start loading edits file http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280
2016-04-04 06:51:34,683 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream (main): Fast-forwarding stream 'http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280' to transaction ID 1
2016-04-04 06:51:34,683 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream (main): Fast-forwarding stream 'http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280' to transaction ID 1
2016-04-04 06:51:34,845 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem (main): Encountered exception loading fsimage
java.io.IOException: There appears to be a gap in the edit log. We expected txid 1, but got txid 19.
at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:837)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:692)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:814)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:798)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1491)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1557)
"

Seems the fsimage was messed up during the replacement. Could someone shed some light on what could possibly be wrong? 

In my "-newEditsOnly" logic, I skipped the following code snippet in the initializeSharedEdits() method because the other name node could be still running and be an active one when the above command was issued. I assumed that the new journal node could rely on the recovery protocol in QJM to sync up with other peers for the latest transactions.

----------------------------------------------------------------------------------------------
        NNStorage newSharedStorage = sharedEditsImage.getStorage();

        // Call Storage.format instead of FSImage.format here, since we don't
        // actually want to save a checkpoint - just prime the dirs with
        // the existing namespace info
        newSharedStorage.format(nsInfo);
        sharedEditsImage.getEditLog().formatNonFileJournals(nsInfo);

        // Need to make sure the edit log segments are in good shape to initialize
        // the shared edits dir.
        fsns.getFSImage().getEditLog().close();
        fsns.getFSImage().getEditLog().initJournalsForWrite();
        fsns.getFSImage().getEditLog().recoverUnclosedStreams();

        copyEditLogSegmentsToSharedDir(fsns, sharedEditsDirs, newSharedStorage,
            conf);
--------------------------------------------------------------------------------------------
In the new formatted Journal node, I did observe the segments were recovered from its peers successfully, for example.

2016-04-11 19:06:38,894 WARN org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 0 on 8485): Storage directory /mnt/var/lib/hadoop/journal/local/data/my-cluster does not exist
2016-04-11 19:06:39,170 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 1 on 8485): Formatting org.apache.hadoop.hdfs.qjournal.server.Journal@21e17c6d with namespace info: lv=-63;cid=CID-72db83e1-fe0c-418a-a861-d9525f73adc8;nsid=1427415229;c=0;bpid=BP-1127919701-10.236.146.62-1460399925301
2016-04-11 19:06:39,170 INFO org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 1 on 8485): Formatting journal Storage Directory /mnt/var/lib/hadoop/journal/local/data/my-cluster with nsid: 1427415229
2016-04-11 19:06:39,216 INFO org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 1 on 8485): Lock on /mnt/var/lib/hadoop/journal/local/data/my-cluster/in_use.lock acquired by nodename 2774@ip-10-225-14-41.ec2.internal
2016-04-11 19:06:49,227 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): Updating lastPromisedEpoch from 0 to 3 for client /10.139.37.248
2016-04-11 19:06:49,274 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): Scanning storage FileJournalManager(root=/mnt/var/lib/hadoop/journal/local/data/my-cluster)
2016-04-11 19:06:49,294 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): No files in FileJournalManager(root=/mnt/var/lib/hadoop/journal/local/data/my-cluster)
2016-04-11 19:06:49,356 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 4 on 8485): Prepared recovery for segment 1667: lastWriterEpoch: 0
2016-04-11 19:06:49,545 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Synchronizing log startTxId: 1667 endTxId: 3104 isInProgress: true: no current segment in place
2016-04-11 19:06:49,553 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Synchronizing log startTxId: 1667 endTxId: 3104 isInProgress: true from http://ip-10-139-37-248.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=1667&storageInfo=-63%3A1427415229%3A0%3ACID-72db83e1-fe0c-418a-a861-d9525f73adc8
2016-04-11 19:06:50,336 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage (IPC Server handler 0 on 8485): Image Transfer timeout configured to 60000 milliseconds
2016-04-11 19:06:50,616 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage (IPC Server handler 0 on 8485): Transfer took 0.25s at 4079.68 KB/s
2016-04-11 19:06:50,632 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Accepted recovery for segment 1667: segmentState { startTxId: 1667 endTxId: 3104 isInProgress: true } acceptedInEpoch: 3
2016-04-11 19:06:50,646 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 2 on 8485): Validating log segment /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_inprogress_0000000000000001667 about to be finalized
2016-04-11 19:06:50,807 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager (IPC Server handler 2 on 8485): Finalizing edits file /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_inprogress_0000000000000001667 -> /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_0000000000000001667-0000000000000003104
2016-04-11 19:06:50,845 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 4 on 8485): Updating lastWriterEpoch from 0 to 3 for client /10.139.37.248

Did I miss anything that could mess up the fsimage files?

Thanks in advance.

> QJM: improve formatting behavior for JNs
> ----------------------------------------
>
>                 Key: HDFS-3743
>                 URL: https://issues.apache.org/jira/browse/HDFS-3743
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>
> Currently, the JournalNodes automatically format themselves when a new writer takes over, if they don't have any data for that namespace. However, this has a few problems:
> 1) if the administrator accidentally points a new NN at the wrong quorum (eg corresponding to another cluster), it will auto-format a directory on those nodes. This doesn't cause any data loss, but would be better to bail out with an error indicating that they need to be formatted.
> 2) if a journal node crashes and needs to be reformatted, it should be able to re-join the cluster and start storing new segments without having to fail over to a new NN.
> 3) if 2/3 JNs get accidentally reformatted (eg the mount point becomes undone), and the user starts the NN, it should fail to start, because it may end up missing edits. If it auto-formats in this case, the user might have silent "rollback" of the most recent edits.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)