Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 06531CEFB for ; Wed, 30 May 2012 13:41:24 +0000 (UTC) Received: (qmail 32127 invoked by uid 500); 30 May 2012 13:41:23 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 32049 invoked by uid 500); 30 May 2012 13:41:23 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 32040 invoked by uid 99); 30 May 2012 13:41:23 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 May 2012 13:41:23 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id 68A5E142857 for ; Wed, 30 May 2012 13:41:23 +0000 (UTC) Date: Wed, 30 May 2012 13:41:23 +0000 (UTC) From: "Ivan Kelly (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <969464568.16166.1338385283431.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1769775729.12863.1337323089092.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HDFS-3441) Race condition between rolling logs at active NN and purging at standby MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285659#comment-13285659 ] Ivan Kelly commented on HDFS-3441: ---------------------------------- the code of the patch looks good. I think the test should be a little lower level and more explicit though. Its not clear that the new code path is actually getting exercised. Make getLedgerList() package private, and test that when you getChildren() returns a list of ledgers, and some do not exist when the read occurs, that you still get the others back fine. For this you would have to use spy() instead of mock(), and just stub out the getData() method. > Race condition between rolling logs at active NN and purging at standby > ----------------------------------------------------------------------- > > Key: HDFS-3441 > URL: https://issues.apache.org/jira/browse/HDFS-3441 > Project: Hadoop HDFS > Issue Type: Sub-task > Affects Versions: 2.0.0-alpha > Reporter: suja s > Assignee: Rakesh R > Attachments: HDFS-3441.1.patch, HDFS-3441.patch > > > Standby NN has got the ledgerlist with list of all files, including the inprogress file (with say inprogress_val1) > Active NN has done finalization and created new inprogress file. > Standby when proceeds further finds that the inprogress file which it had in the list is not present and NN gets shutdown > NN Logs > ========= > 2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 201 saved in 0 seconds. > 2012-05-17 22:15:03,874 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode /xx.xx.xx.102:8020 > 2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 111 > 2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_0000000000000000109, cpktTxId=0000000000000000109) > 2012-05-17 22:15:03,961 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767, stream=null)) > java.io.IOException: Exception reading ledger list from zk > at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531) > at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444) > at org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541) > at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322) > at org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538) > at org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011) > at org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98) > at org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900) > at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885) > at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822) > at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157) > at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52) > at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279) > at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200) > at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220) > at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512) > at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216) > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72 > at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142) > at org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113) > at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528) > ... 16 more > 2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: > ZK Data > ================ > [zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74 > -40;59;116 > cZxid = 0x2be > ctime = Thu May 17 22:15:03 IST 2012 > mZxid = 0x2be > mtime = Thu May 17 22:15:03 IST 2012 > pZxid = 0x2be > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x0 > dataLength = 10 > numChildren = 0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira