Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E4896730E for ; Tue, 25 Oct 2011 00:06:46 +0000 (UTC) Received: (qmail 74046 invoked by uid 500); 25 Oct 2011 00:06:46 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 74004 invoked by uid 500); 25 Oct 2011 00:06:46 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 73995 invoked by uid 99); 25 Oct 2011 00:06:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Oct 2011 00:06:46 +0000 X-ASF-Spam-Status: No, hits=2.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [212.54.34.165] (HELO smtpq2.gn.mail.iss.as9143.net) (212.54.34.165) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Oct 2011 00:06:39 +0000 Received: from [212.54.34.145] (helo=smtp14.gn.mail.iss.as9143.net) by smtpq2.gn.mail.iss.as9143.net with esmtp (Exim 4.71) (envelope-from ) id 1RIUX4-0004Pf-9k for hdfs-user@hadoop.apache.org; Tue, 25 Oct 2011 02:06:18 +0200 Received: from 5249ddab.cm-4-2d.dynamic.ziggo.nl ([82.73.221.171] helo=relax.localnet) by smtp14.gn.mail.iss.as9143.net with esmtp (Exim 4.71) (envelope-from ) id 1RIUX0-0000PV-EU for hdfs-user@hadoop.apache.org; Tue, 25 Oct 2011 02:06:14 +0200 From: Markus Jelsma Reply-To: markus.jelsma@openindex.io Organization: Openindex To: hdfs-user@hadoop.apache.org Subject: Re: NameNode corruption: NPE addChild at start up Date: Tue, 25 Oct 2011 02:00:57 +0200 User-Agent: KMail/1.13.5 (Linux/2.6.35-30-generic; KDE/4.5.5; i686; ; ) References: <201110242217.45815.markus.jelsma@openindex.io> <201110242224.40669.markus.jelsma@openindex.io> In-Reply-To: <201110242224.40669.markus.jelsma@openindex.io> MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201110250200.58392.markus.jelsma@openindex.io> X-ZiggoSMTP-MailScanner-Information: Please contact the ISP for more information X-ZiggoSMTP-MailScanner-ID: 1RIUX0-0000PV-EU X-ZiggoSMTP-MailScanner: Found to be clean X-ZiggoSMTP-MailScanner-SpamCheck: geen spam, SpamAssassin (niet cached, score=-3.496, vereist 5, autolearn=not spam, ALL_TRUSTED -1.00, BAYES_00 -1.90, CM_REPLY_NOARROW 0.30, PROLO_TRUST_RDNS -3.00, RDNS_DYNAMIC 0.98, URI_HEX 1.12) X-ZiggoSMTP-MailScanner-From: markus.jelsma@openindex.io X-Old-Spam-Status: No A subsequent test under similar consitions showed an additional NameNode log line prior to shutdown: 2011-10-25 00:00:03,996 INFO org.apache.hadoop.fs.Trash: Created trash checkpoint: /user/systems/.Trash/1110250000 The following restart was without issues. Because the operations were manual i might have mispelled -skipTrash in one occasion. > Log output prior to first shutdown: > > Many StateChange lines: > .. > 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask > 178.21.118.71:50010 to delete blk_-6474991874786417864_1 > 10366 blk_4835119182915024496_110378 blk_5400176265071942037_110378 > blk_3844651748475104977_110378 blk_-1608821730784127393_110376 bl > k_6509383172138875668_110378 blk_5451246938053636538_110378 > blk_2897026915704967576_110378 blk_-8171354591956681743_110358 blk_831230 > 8457484333010_110378 blk_-3230829712632219383_110378 > 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask > 178.21.118.127:50010 to delete blk_-3399604157806337594_ > 110378 blk_6747099268224653868_110330 blk_-8319971768460712343_110366 > blk_-2961731871717799971_110378 blk_7869501598834020846_110374 > blk_8093290740979129346_110370 blk_-6013635848766803390_110378 > blk_7993259237237421138_110378 blk_5759478854526777476_110378 blk_-249 > 5357027905458427_110366 blk_-1854083208115601228_110366 > 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask > 178.21.118.106:50010 to delete blk_-7802030530595759204_ > 110378 blk_8721277224606661791_110374 blk_6747099268224653868_110330 > blk_5656147162240510324_110378 blk_2897026915704967576_110378 bl > k_-240991845576310137_110378 blk_5938453705724038271_110378 > blk_8312308457484333010_110378 blk_-1854083208115601228_110366 blk_469356 > 3029097482715_110378 blk_2642639928334327569_110374 > blk_-1087530163655362136_110378 blk_-330417000415464230_110378 > blk_54512469380536 > 36538_110378 blk_6300864378343690857_110378 blk_9188777915480180330_110378 > blk_6198541037144409710_110378 > 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask > 178.21.118.102:50010 to delete blk_-8874651062228528404_ > 110378 blk_-3399604157806337594_110378 blk_5374073877837004061_110338 > blk_-3734465814030860048_110370 blk_4355408303621436967_110378 > blk_4924117654799449604_110378 blk_1289214837876543883_110378 > blk_-240991845576310137_110378 blk_5938453705724038271_110378 blk_-8294 > 59405706536971_110370 blk_-7851154197821762387_110322 > blk_2792433696712651711_110378 blk_-361468857299887395_110378 > blk_5345112692694 > 71506_110342 blk_-7018473240249214767_110378 blk_5393329649654239848_110370 > 2011-10-24 10:10:12,065 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of > transactions: 1281 Total time for transac > tions(ms): 21Number of transactions batched in Syncs: 10 Number of syncs: > 870 SyncTimes(ms): 13578 > 2011-10-24 10:10:26,250 INFO > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG > /************************************************************ > SHUTDOWN_MSG: Shutting down NameNode at > namenode01.openindex.io/178.21.118.98 > ************************************************************/ > > > Following startup: > > .. > 2011-10-24 10:11:58,601 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), > accessTokenLifetime=0 min(s) > 2011-10-24 10:12:00,478 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered > FSNamesystemStateMBean and NameNodeMXBean > 2011-10-24 10:12:00,592 INFO > org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names > occuring more than 10 times > 2011-10-24 10:12:00,609 INFO org.apache.hadoop.hdfs.server.common.Storage: > Number of files = 58901 > 2011-10-24 10:12:17,101 INFO org.apache.hadoop.hdfs.server.common.Storage: > Number of files under construction = 1 > 2011-10-24 10:12:17,373 INFO org.apache.hadoop.hdfs.server.common.Storage: > Image file of size 8688539 loaded in 16 seconds. > 2011-10-24 10:12:17,376 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav > a:1113) at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav > a:1125) at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java > :1028) at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir > ectory.java:205) at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java > :613) at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:100 > 9) at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827 > ) at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag > e.java:365) at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory. > java:97) at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem > .java:379) at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.jav > a:353) at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:25 > 4) at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:434) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav > a:1153) at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162) > > > Hi, > > > > We are on Hadoop 0.20.203 and use HDFS and MapRed for crawling using > > Apache Nutch. This morning we had our first encounter with an unwilling > > NameNode due to a very similar issue descriped on the list earlier [1]. > > > > We replaced with a backed up checkpoint, restarted the daemon and moved > > files with missing blocks to lost+found. Only transient data was > > affected. > > > > Just now it happened again, same remedy and same results but it occured > > to me that there may be a reason besides coincidence or bad luck. After > > processing several GB's of prepared data we delete (-skipTrash) what we > > don't need and hurry on. > > > > Prior to both corruptions a several GB's were deleted and HDFS stopped > > only seconds later. Is there a possible relation here? It's not > > something i can test because we are not willing to risk corruption of > > important data; we're moving it to other locations as we speak. > > > > Any tips on how to prevent this from happening again or finding the > > source of the problem are very much appreciated. > > > > [1]: > > http://hadoop-common.472056.n3.nabble.com/addChild-NullPointerException- > > when-starting-namenode-and-reading-edits-file-td92226.html > > > > Thanks