Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E4E706A9B for ; Wed, 18 May 2011 11:17:52 +0000 (UTC) Received: (qmail 59263 invoked by uid 500); 18 May 2011 11:17:52 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 59217 invoked by uid 500); 18 May 2011 11:17:52 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 59209 invoked by uid 99); 18 May 2011 11:17:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 May 2011 11:17:52 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [202.0.116.57] (HELO mail.creditpointe.com) (202.0.116.57) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 May 2011 11:17:45 +0000 Received: from [172.20.4.69] (172.20.4.69) by mail.creditpointe.com (172.20.1.132) with Microsoft SMTP Server (TLS) id 8.3.106.1; Wed, 18 May 2011 16:40:51 +0530 Message-ID: <4DD3AAB6.60002@creditpointe.com> Date: Wed, 18 May 2011 16:47:10 +0530 From: Vishaal Jatav User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 MIME-Version: 1.0 To: , Srinivasarao Vundavalli , Manjunath Sindagi Subject: Null Pointer Exception while re-starting the Hadoop Cluster Content-Type: multipart/alternative; boundary="------------070109000408090407030605" --------------070109000408090407030605 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Hi. We are using a cluster of 2 computers (1 namenode and 2 secondarynodes) to store a large number of text files in the HDFS. The process had been running for atleast a couple of weeks when suddenly due to some power failure, the server got reset. So, in effect, the HDFS didn't stop cleanly. When I tried to restart the cluster, I got a Null Pointer Exception, with the following stack trace (from the logs). 2011-05-18 06:57:39,313 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=YYYYY 2011-05-18 06:57:39,321 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: master/172.XXX.XXX.XXX:YYYYY 2011-05-18 06:57:39,326 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2011-05-18 06:57:39,329 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-05-18 06:57:39,444 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=vishaal,vishaal 2011-05-18 06:57:39,444 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-05-18 06:57:39,444 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2011-05-18 06:57:39,459 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-05-18 06:57:39,461 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2011-05-18 06:57:39,521 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1 2011-05-18 06:57:39,531 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0 2011-05-18 06:57:39,531 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 97 loaded in 0 seconds. 2011-05-18 06:57:39,532 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /home/vishaal/hadoop-0.20.2/tmp/dfs/name/current/edits of size 0 edits # 0 loaded in 0 seconds. 2011-05-18 06:57:39,535 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1320) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1309) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:997) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) 2011-05-18 06:57:39,537 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at 172.XXX.XXX.XXX ************************************************************/ Though this was just an experiment to test the reliability of the HDFS storage, I would love to get it running again. This is, of course, hoping that the data could be recovered (if it is corrupted). A couple of more questions: * Is this a common problem? Is there any available patch? (Although I couldn't get after a lot of Googling). * If the servers are prone to power failures, is it a good choice to continue with HDFS for storage of data? * If this occurs, does it mean that all the data is corrupt? Does it mean not all but some data is corrupt? Can the corrupted data be recovered? Would appreciate a prompt reply as this was an attempt to prove the concept of using distributed file system to store large amount of text as opposed to a relational database. (I hope you understand that I am on the line of fire). Thanks in advance. Vishaal Jatav. (vishaal[dot]iitb04[at]gmail[dot]com) --------------070109000408090407030605 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Hi.

We are using a cluster of 2 computers (1 namenode and 2 secondarynodes) to store a large number of text files in the HDFS. The process had been running for atleast a couple of weeks when suddenly due to some power failure, the server got reset. So, in effect, the HDFS didn't stop cleanly. When I tried to restart the cluster, I got a Null Pointer Exception, with the following stack trace (from the logs).

2011-05-18 06:57:39,313 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=YYYYY
2011-05-18 06:57:39,321 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: master/172.XXX.XXX.XXX:YYYYY
2011-05-18 06:57:39,326 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null
2011-05-18 06:57:39,329 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext
2011-05-18 06:57:39,444 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=vishaal,vishaal
2011-05-18 06:57:39,444 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2011-05-18 06:57:39,444 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2011-05-18 06:57:39,459 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext
2011-05-18 06:57:39,461 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean
2011-05-18 06:57:39,521 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1
2011-05-18 06:57:39,531 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0
2011-05-18 06:57:39,531 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 97 loaded in 0 seconds.
2011-05-18 06:57:39,532 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /home/vishaal/hadoop-0.20.2/tmp/dfs/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
2011-05-18 06:57:39,535 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1320)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1309)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:997)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

2011-05-18 06:57:39,537 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at 172.XXX.XXX.XXX
************************************************************/

Though this was just an experiment to test the reliability of the HDFS storage, I would love to get it running again. This is, of course, hoping that the data could be recovered (if it is corrupted). A couple of more questions:
  • Is this a common problem? Is there any available patch? (Although I couldn't get after a lot of Googling).
  • If the servers are prone to power failures, is it a good choice to continue with HDFS for storage of data?
  • If this occurs, does it mean that all the data is corrupt? Does it mean not all but some data is corrupt? Can the corrupted data be recovered?
Would appreciate a prompt reply as this was an attempt to prove the concept of using distributed file system to store large amount of text as opposed to a relational database. (I hope you understand that I am on the line of fire).

Thanks in advance.
Vishaal Jatav.
(vishaal[dot]iitb04[at]gmail[dot]com)
--------------070109000408090407030605--