Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 17629 invoked from network); 5 May 2009 19:04:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 May 2009 19:04:00 -0000 Received: (qmail 571 invoked by uid 500); 5 May 2009 19:03:57 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 486 invoked by uid 500); 5 May 2009 19:03:57 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 476 invoked by uid 99); 5 May 2009 19:03:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 May 2009 19:03:57 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.172] (HELO mrout2.yahoo.com) (216.145.54.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 May 2009 19:03:46 +0000 Received: from [10.72.106.226] (heighthigh-lx.corp.yahoo.com [10.72.106.226]) by mrout2.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id n45J16xV058770 for ; Tue, 5 May 2009 12:01:06 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:subject: references:in-reply-to:content-type:content-transfer-encoding; b=DKz+DiXHDojnt2fYSsPhmDsI/13DQMMbATJ0aJYyA3Ct/Eke0h0v57o1C+fEaeSJ Message-ID: <4A008CF2.7050007@yahoo-inc.com> Date: Tue, 05 May 2009 12:01:06 -0700 From: Raghu Angadi User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: core-user@hadoop.apache.org Subject: Re: Namenode failed to start with "FSNamesystem initialization failed" error References: <77938bc20905040550o363118f7re37fb72efb243787@mail.gmail.com> <6d10e930905040553o62451ee7g9603e42f31738ab2@mail.gmail.com> <77938bc20905040648t1b5790acv5cb1e31aa0da25b0@mail.gmail.com> <6d10e930905042237jf749095h8e0542688c57ca63@mail.gmail.com> <77938bc20905050552x1ad92f07o72c4909038b319d@mail.gmail.com> <4A00827B.1070101@yahoo-inc.com> <77938bc20905051144h21a55cd7m3c7c2ff936697fb6@mail.gmail.com> <77938bc20905051147q116565afi2161ae0ca1ca88f6@mail.gmail.com> In-Reply-To: <77938bc20905051147q116565afi2161ae0ca1ca88f6@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Stas Oskin wrote: > Actually, we discovered today an annoying bug in our test-app, which might > have moved some of the HDFS files to the cluster, including the metadata > files. oops! presumably it could have removed the image file itself. > I presume it could be the possible reason for such behavior? :) certainly. It could lead to many different failures. If you had stack trace of the exception, it would be more clear what the error was this time. Raghu. > 2009/5/5 Stas Oskin > >> Hi Raghu. >> >> The only lead I have, is that my root mount has filled-up completely. >> >> This in itself should not have caused the metadata corruption, as it has >> been stored on another mount point, which had plenty of space. >> >> But perhaps the fact that NameNode/SecNameNode didn't have enough space for >> logs has caused this? >> >> Unfortunately I was pressed in time to get the cluster up and running, and >> didn't preserve the logs or the image. >> If this happens again - I will surely do so. >> >> Regards. >> >> 2009/5/5 Raghu Angadi >> >> >>> Stas, >>> >>> This is indeed a serious issue. >>> >>> Did you happen to store the the corrupt image? Can this be reproduced >>> using the image? >>> >>> Usually you can recover manually from a corrupt or truncated image. But >>> more importantly we want to find how it got in to this state. >>> >>> Raghu. >>> >>> >>> Stas Oskin wrote: >>> >>>> Hi. >>>> >>>> This quite worry-some issue. >>>> >>>> Can anyone advice on this? I'm really concerned it could appear in >>>> production, and cause a huge data loss. >>>> >>>> Is there any way to recover from this? >>>> >>>> Regards. >>>> >>>> 2009/5/5 Tamir Kamara >>>> >>>> I didn't have a space problem which led to it (I think). The corruption >>>>> started after I bounced the cluster. >>>>> At the time, I tried to investigate what led to the corruption but >>>>> didn't >>>>> find anything useful in the logs besides this line: >>>>> saveLeases found path >>>>> >>>>> >>>>> /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002 >>>>> but no matching entry in namespace >>>>> >>>>> I also tried to recover from the secondary name node files but the >>>>> corruption my too wide-spread and I had to format. >>>>> >>>>> Tamir >>>>> >>>>> On Mon, May 4, 2009 at 4:48 PM, Stas Oskin >>>>> wrote: >>>>> >>>>> Hi. >>>>>> Same conditions - where the space has run out and the fs got corrupted? >>>>>> >>>>>> Or it got corrupted by itself (which is even more worrying)? >>>>>> >>>>>> Regards. >>>>>> >>>>>> 2009/5/4 Tamir Kamara >>>>>> >>>>>> I had the same problem a couple of weeks ago with 0.19.1. Had to >>>>>> reformat >>>>>> the cluster too... >>>>>>> On Mon, May 4, 2009 at 3:50 PM, Stas Oskin >>>>>>> >>>>>> wrote: >>>>>> Hi. >>>>>>>> After rebooting the NameNode server, I found out the NameNode doesn't >>>>>>>> >>>>>>> start >>>>>>> >>>>>>>> anymore. >>>>>>>> >>>>>>>> The logs contained this error: >>>>>>>> "FSNamesystem initialization failed" >>>>>>>> >>>>>>>> >>>>>>>> I suspected filesystem corruption, so I tried to recover from >>>>>>>> SecondaryNameNode. Problem is, it was completely empty! >>>>>>>> >>>>>>>> I had an issue that might have caused this - the root mount has run >>>>>>>> >>>>>>> out >>>>>> of >>>>>>>> space. But, both the NameNode and the SecondaryNameNode directories >>>>>>>> >>>>>>> were >>>>>>> on >>>>>>> >>>>>>>> another mount point with plenty of space there - so it's very strange >>>>>>>> >>>>>>> that >>>>>>> >>>>>>>> they were impacted in any way. >>>>>>>> >>>>>>>> Perhaps the logs, which were located on root mount and as a result, >>>>>>>> >>>>>>> could >>>>>>> not be written, have caused this? >>>>>>>> >>>>>>>> To get back HDFS running, i had to format the HDFS (including >>>>>>>> >>>>>>> manually >>>>>> erasing the files from DataNodes). While this reasonable in test >>>>>>>> environment >>>>>>>> - production-wise it would be very bad. >>>>>>>> >>>>>>>> Any idea why it happened, and what can be done to prevent it in the >>>>>>>> >>>>>>> future? >>>>>>> >>>>>>>> I'm using the stable 0.18.3 version of Hadoop. >>>>>>>> >>>>>>>> Thanks in advance! >>>>>>>> >>>>>>>> >