Return-Path: Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: (qmail 24963 invoked from network); 5 Oct 2010 16:20:51 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Oct 2010 16:20:51 -0000 Received: (qmail 25704 invoked by uid 500); 5 Oct 2010 16:20:51 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 25656 invoked by uid 500); 5 Oct 2010 16:20:50 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 25648 invoked by uid 99); 5 Oct 2010 16:20:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Oct 2010 16:20:50 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.139.91.72] (HELO nm2.bullet.mail.sp2.yahoo.com) (98.139.91.72) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 05 Oct 2010 16:20:44 +0000 Received: from [98.139.91.66] by nm2.bullet.mail.sp2.yahoo.com with NNFMP; 05 Oct 2010 16:20:23 -0000 Received: from [98.139.91.36] by tm6.bullet.mail.sp2.yahoo.com with NNFMP; 05 Oct 2010 16:20:23 -0000 Received: from [127.0.0.1] by omp1036.mail.sp2.yahoo.com with NNFMP; 05 Oct 2010 16:20:23 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 57745.44362.bm@omp1036.mail.sp2.yahoo.com Received: (qmail 19642 invoked by uid 60001); 5 Oct 2010 16:20:22 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1286295622; bh=ToJuzUnKvfUoTR6FYc006X1gX2pfb87Im/FnbY7MQ6k=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=Anb+lmwn9XzbHXCwhu6JpP4AXd/RcRt/W+ci8J9ZHVPz90SKy7p31bx8ufnMi0GZHdgUPAIqeHdvluQXuwyJ1vskIhyUDdbMaa8guokroXbzCHtdApeohlS6sH1pbWMMj7tRx+7hqlor7JZ6dyV03mOH+XZAEHX3Ssn/H+TzXh8= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=1eiGyKTupTa4cNP1JiDTAb4/pPkK3tI8K9LO9ocI5ERwEy4uXvHvYqDwq3AN4Q5Baw7QjXKRNKat3WAXCOfUEed6xh4eo1IgrUFf99rNgdALNKLbfQXRkdo/UWgOHmk4Glk0ik114X8W+EEgqftpQA0lkkIqFRi4alBQ1YeZcr4=; Message-ID: <136888.18563.qm@web110502.mail.gq1.yahoo.com> X-YMail-OSG: ky7S77oVM1le1R3DQajXRX16WJERv2qIILd9N3X61bzryZh MeRI68KI1Vu6K2UGJFmNp2krPGRnMmNwVh6XnKnu1h8i8VMaGxNsP4ufNAVA PWk_esa3KsmG28aAF.Uuaard_CXTpOzueoZXl7o2fLVXbOzzpxbLvkJhCl_R cN.YgW.AMM7VrA6u4NXa986kdAGQa5KUW7fZ6YDTR04c_RijR_YuU6KyItPq hsRqGZFqYh_nYIpry6cqrHPdxkbXpLDrAJoyAzCcjAbgSyEKp.ROGTOtWiwB EmkdeJXuqp5zKy8ycHGZ9d1oFHNAbiI66s6cRblS1liPB3id_qG9btQjR4WH 2SIwsYgvweAJ9KMgTVwoM3jruwGSG Received: from [24.130.59.113] by web110502.mail.gq1.yahoo.com via HTTP; Tue, 05 Oct 2010 09:20:21 PDT X-Mailer: YahooMailRC/504.5 YahooMailWebService/0.8.106.282862 References: Date: Tue, 5 Oct 2010 09:20:21 -0700 (PDT) From: Ayon Sinha Subject: Re: NameNode crash - cannot start dfs - need help To: hdfs-user@hadoop.apache.org In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-284966218-1286295621=:18563" --0-284966218-1286295621=:18563 Content-Type: text/plain; charset=us-ascii We had almost exact problem of namenode filling up and namnode failing at this exact same point. Since you have created space now you can copy over the edits.new, fsimage and the other 2 files from your /mnt/namesecondarynode/current and try restarting the namenode. I believe you will loose some edits and probably some blocks of some files but we could recover most of our files. -Ayon ________________________________ From: Matthew LeMieux To: hdfs-user@hadoop.apache.org Sent: Tue, October 5, 2010 8:16:15 AM Subject: NameNode crash - cannot start dfs - need help The namenode on an otherwise very stable HDFS cluster crashed recently. The filesystem filled up on the name node, which I assume is what caused the crash. The problem has been fixed, but I cannot get the namenode to restart. I am using version CDH3b2 (hadoop-0.20.2+320). The error is this: 2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds. 2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:419) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) ... This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the edits file with a hex editor, but does not explain where the record boundaries are. It is a different exception, but seemed like a similar cause, the edits file. I tried removing a line at a time, but the error continues, only with a smaller size and edits #: 2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds. 2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:419) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) ... I tried removing the edits file altogether, but that failed with: java.io.IOException: Edits file is not found I tried with a zero length edits file, so it would at least have a file there, but that results in an NPE: 2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds. 2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199) Most if not all the files I noticed in the edits file are temporary files that will be deleted once this thing gets back up and running anyway. There is a closed ticket that might be related: https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm using seems to already have HDFS-686 (according to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html) What do I have to do to get back up and running? Thank you for your help, Matthew --0-284966218-1286295621=:18563 Content-Type: text/html; charset=us-ascii
We had almost exact problem of namenode filling up and namnode failing at this exact same point. Since you have created space now you can copy over the 
edits.new, fsimage and the other 2 files from your /mnt/namesecondarynode/current and try restarting the namenode.
I believe you will loose some edits and probably some blocks of some files but we could recover most of our files.
 
-Ayon



From: Matthew LeMieux <mdl@mlogiciels.com>
To: hdfs-user@hadoop.apache.org
Sent: Tue, October 5, 2010 8:16:15 AM
Subject: NameNode crash - cannot start dfs - need help

The namenode on an otherwise very stable HDFS cluster crashed recently.  The filesystem filled up on the name node, which I assume is what caused the crash.    The problem has been fixed, but I cannot get the namenode to restart.  I am using version CDH3b2  (hadoop-0.20.2+320). 

The error is this: 

2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.
2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Long.parseLong(Long.java:419)
        at java.lang.Long.parseLong(Long.java:468)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
        ...

This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the edits file with a hex editor, but does not explain where the record boundaries are.  It is a different exception, but seemed like a similar cause, the edits file.  I tried removing a line at a time, but the error continues, only with a smaller size and edits #: 

2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.
2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Long.parseLong(Long.java:419)
        at java.lang.Long.parseLong(Long.java:468)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
        ...

I tried removing the edits file altogether, but that failed with: java.io.IOException: Edits file is not found

I tried with a zero length edits file, so it would at least have a file there, but that results in an NPE: 

2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)


Most if not all the files I noticed in the edits file are temporary files that will be deleted once this thing gets back up and running anyway.    There is a closed ticket that might be related: https://issues.apache.org/jira/browse/HDFS-686 ,  but the version I'm using seems to already have HDFS-686 (according to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)  

What do I have to do to get back up and running?

Thank you for your help, 

Matthew



--0-284966218-1286295621=:18563--