Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C42A4193F0 for ; Mon, 11 Apr 2016 20:22:26 +0000 (UTC) Received: (qmail 23487 invoked by uid 500); 11 Apr 2016 20:22:26 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 23408 invoked by uid 500); 11 Apr 2016 20:22:26 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 23362 invoked by uid 99); 11 Apr 2016 20:22:26 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Apr 2016 20:22:26 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 494BF2C1F4E for ; Mon, 11 Apr 2016 20:22:26 +0000 (UTC) Date: Mon, 11 Apr 2016 20:22:26 +0000 (UTC) From: "Jian Fang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-3743) QJM: improve formatting behavior for JNs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-3743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235906#comment-15235906 ] Jian Fang commented on HDFS-3743: --------------------------------- I implemented the "-newEditsOnly" option and tested it on a cluster with 3 instances that ran server side daemons such as name nodes, resource managers, journal nodes, and others. I replaced two out of the three instances successfully with name node running fine by issuing the command "hdfs namenode -initializeSharedEdits -newEditsOnly" before started a name node. However, I ran into an issue when I replaced the last existing instance, i.e., replaced the original m1,m2, and m3 with m4, m5, and m6 one at a time. The name node suddenly stuck in loading fsimage. The error message was as follows. " 2016-04-04 06:51:34,612 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode (main): Loading 1 INodes. 2016-04-04 06:51:34,670 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf (main): Loaded FSImage in 0 seconds. 2016-04-04 06:51:34,670 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Loaded image for txid 0 from /mnt/var/lib/hadoop/dfs-name/current/fsimage_0000000000000000000 2016-04-04 06:51:34,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@3af6b48f expecting start txid #1 2016-04-04 06:51:34,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Start loading edits file http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280 2016-04-04 06:51:34,683 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream (main): Fast-forwarding stream 'http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280' to transaction ID 1 2016-04-04 06:51:34,683 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream (main): Fast-forwarding stream 'http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280' to transaction ID 1 2016-04-04 06:51:34,845 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem (main): Encountered exception loading fsimage java.io.IOException: There appears to be a gap in the edit log. We expected txid 1, but got txid 19. at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:837) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:692) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:294) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:814) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:798) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1491) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1557) " Seems the fsimage was messed up during the replacement. Could someone shed some light on what could possibly be wrong? In my "-newEditsOnly" logic, I skipped the following code snippet in the initializeSharedEdits() method because the other name node could be still running and be an active one when the above command was issued. I assumed that the new journal node could rely on the recovery protocol in QJM to sync up with other peers for the latest transactions. ---------------------------------------------------------------------------------------------- NNStorage newSharedStorage = sharedEditsImage.getStorage(); // Call Storage.format instead of FSImage.format here, since we don't // actually want to save a checkpoint - just prime the dirs with // the existing namespace info newSharedStorage.format(nsInfo); sharedEditsImage.getEditLog().formatNonFileJournals(nsInfo); // Need to make sure the edit log segments are in good shape to initialize // the shared edits dir. fsns.getFSImage().getEditLog().close(); fsns.getFSImage().getEditLog().initJournalsForWrite(); fsns.getFSImage().getEditLog().recoverUnclosedStreams(); copyEditLogSegmentsToSharedDir(fsns, sharedEditsDirs, newSharedStorage, conf); -------------------------------------------------------------------------------------------- In the new formatted Journal node, I did observe the segments were recovered from its peers successfully, for example. 2016-04-11 19:06:38,894 WARN org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 0 on 8485): Storage directory /mnt/var/lib/hadoop/journal/local/data/my-cluster does not exist 2016-04-11 19:06:39,170 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 1 on 8485): Formatting org.apache.hadoop.hdfs.qjournal.server.Journal@21e17c6d with namespace info: lv=-63;cid=CID-72db83e1-fe0c-418a-a861-d9525f73adc8;nsid=1427415229;c=0;bpid=BP-1127919701-10.236.146.62-1460399925301 2016-04-11 19:06:39,170 INFO org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 1 on 8485): Formatting journal Storage Directory /mnt/var/lib/hadoop/journal/local/data/my-cluster with nsid: 1427415229 2016-04-11 19:06:39,216 INFO org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 1 on 8485): Lock on /mnt/var/lib/hadoop/journal/local/data/my-cluster/in_use.lock acquired by nodename 2774@ip-10-225-14-41.ec2.internal 2016-04-11 19:06:49,227 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): Updating lastPromisedEpoch from 0 to 3 for client /10.139.37.248 2016-04-11 19:06:49,274 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): Scanning storage FileJournalManager(root=/mnt/var/lib/hadoop/journal/local/data/my-cluster) 2016-04-11 19:06:49,294 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): No files in FileJournalManager(root=/mnt/var/lib/hadoop/journal/local/data/my-cluster) 2016-04-11 19:06:49,356 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 4 on 8485): Prepared recovery for segment 1667: lastWriterEpoch: 0 2016-04-11 19:06:49,545 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Synchronizing log startTxId: 1667 endTxId: 3104 isInProgress: true: no current segment in place 2016-04-11 19:06:49,553 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Synchronizing log startTxId: 1667 endTxId: 3104 isInProgress: true from http://ip-10-139-37-248.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=1667&storageInfo=-63%3A1427415229%3A0%3ACID-72db83e1-fe0c-418a-a861-d9525f73adc8 2016-04-11 19:06:50,336 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage (IPC Server handler 0 on 8485): Image Transfer timeout configured to 60000 milliseconds 2016-04-11 19:06:50,616 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage (IPC Server handler 0 on 8485): Transfer took 0.25s at 4079.68 KB/s 2016-04-11 19:06:50,632 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Accepted recovery for segment 1667: segmentState { startTxId: 1667 endTxId: 3104 isInProgress: true } acceptedInEpoch: 3 2016-04-11 19:06:50,646 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 2 on 8485): Validating log segment /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_inprogress_0000000000000001667 about to be finalized 2016-04-11 19:06:50,807 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager (IPC Server handler 2 on 8485): Finalizing edits file /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_inprogress_0000000000000001667 -> /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_0000000000000001667-0000000000000003104 2016-04-11 19:06:50,845 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 4 on 8485): Updating lastWriterEpoch from 0 to 3 for client /10.139.37.248 Did I miss anything that could mess up the fsimage files? Thanks in advance. > QJM: improve formatting behavior for JNs > ---------------------------------------- > > Key: HDFS-3743 > URL: https://issues.apache.org/jira/browse/HDFS-3743 > Project: Hadoop HDFS > Issue Type: Sub-task > Affects Versions: QuorumJournalManager (HDFS-3077) > Reporter: Todd Lipcon > > Currently, the JournalNodes automatically format themselves when a new writer takes over, if they don't have any data for that namespace. However, this has a few problems: > 1) if the administrator accidentally points a new NN at the wrong quorum (eg corresponding to another cluster), it will auto-format a directory on those nodes. This doesn't cause any data loss, but would be better to bail out with an error indicating that they need to be formatted. > 2) if a journal node crashes and needs to be reformatted, it should be able to re-join the cluster and start storing new segments without having to fail over to a new NN. > 3) if 2/3 JNs get accidentally reformatted (eg the mount point becomes undone), and the user starts the NN, it should fail to start, because it may end up missing edits. If it auto-formats in this case, the user might have silent "rollback" of the most recent edits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)