Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=message-id:date:from:user-agent:mime-version:to:subject:
	references:in-reply-to:content-type:content-transfer-encoding;
	b=DKz+DiXHDojnt2fYSsPhmDsI/13DQMMbATJ0aJYyA3Ct/Eke0h0v57o1C+fEaeSJ
Message-ID: <4A008CF2.7050007@yahoo-inc.com>
Date: Tue, 05 May 2009 12:01:06 -0700
From: Raghu Angadi <rangadi@yahoo-inc.com>
User-Agent: Thunderbird 2.0.0.21 (Windows/20090302)
MIME-Version: 1.0
To: core-user@hadoop.apache.org
Subject: Re: Namenode failed to start with "FSNamesystem initialization
 	failed"
 error
References: <77938bc20905040550o363118f7re37fb72efb243787@mail.gmail.com>
	 <6d10e930905040553o62451ee7g9603e42f31738ab2@mail.gmail.com>
	 <77938bc20905040648t1b5790acv5cb1e31aa0da25b0@mail.gmail.com>
	 <6d10e930905042237jf749095h8e0542688c57ca63@mail.gmail.com>
	 <77938bc20905050552x1ad92f07o72c4909038b319d@mail.gmail.com>
	 <4A00827B.1070101@yahoo-inc.com>
	 <77938bc20905051144h21a55cd7m3c7c2ff936697fb6@mail.gmail.com>
 <77938bc20905051147q116565afi2161ae0ca1ca88f6@mail.gmail.com>
In-Reply-To: <77938bc20905051147q116565afi2161ae0ca1ca88f6@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Stas Oskin wrote:
> Actually, we discovered today an annoying bug in our test-app, which might
> have moved some of the HDFS files to the cluster, including the metadata
> files.

oops! presumably it could have removed the image file itself.

> I presume it could be the possible reason for such behavior? :)

certainly. It could lead to many different failures. If you had stack 
trace of the exception, it would be more clear what the error was this time.

Raghu.

> 2009/5/5 Stas Oskin <stas.oskin@gmail.com>
> 
>> Hi Raghu.
>>
>> The only lead I have, is that my root mount has filled-up completely.
>>
>> This in itself should not have caused the metadata corruption, as it has
>> been stored on another mount point, which had plenty of space.
>>
>> But perhaps the fact that NameNode/SecNameNode didn't have enough space for
>> logs has caused this?
>>
>> Unfortunately I was pressed in time to get the cluster up and running, and
>> didn't preserve the logs or the image.
>> If this happens again - I will surely do so.
>>
>> Regards.
>>
>> 2009/5/5 Raghu Angadi <rangadi@yahoo-inc.com>
>>
>>
>>> Stas,
>>>
>>> This is indeed a serious issue.
>>>
>>> Did you happen to store the the corrupt image? Can this be reproduced
>>> using the image?
>>>
>>> Usually you can recover manually from a corrupt or truncated image. But
>>> more importantly we want to find how it got in to this state.
>>>
>>> Raghu.
>>>
>>>
>>> Stas Oskin wrote:
>>>
>>>> Hi.
>>>>
>>>> This quite worry-some issue.
>>>>
>>>> Can anyone advice on this? I'm really concerned it could appear in
>>>> production, and cause a huge data loss.
>>>>
>>>> Is there any way to recover from this?
>>>>
>>>> Regards.
>>>>
>>>> 2009/5/5 Tamir Kamara <tamirkamara@gmail.com>
>>>>
>>>>  I didn't have a space problem which led to it (I think). The corruption
>>>>> started after I bounced the cluster.
>>>>> At the time, I tried to investigate what led to the corruption but
>>>>> didn't
>>>>> find anything useful in the logs besides this line:
>>>>> saveLeases found path
>>>>>
>>>>>
>>>>> /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002
>>>>> but no matching entry in namespace
>>>>>
>>>>> I also tried to recover from the secondary name node files but the
>>>>> corruption my too wide-spread and I had to format.
>>>>>
>>>>> Tamir
>>>>>
>>>>> On Mon, May 4, 2009 at 4:48 PM, Stas Oskin <stas.oskin@gmail.com>
>>>>> wrote:
>>>>>
>>>>>  Hi.
>>>>>> Same conditions - where the space has run out and the fs got corrupted?
>>>>>>
>>>>>> Or it got corrupted by itself (which is even more worrying)?
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>> 2009/5/4 Tamir Kamara <tamirkamara@gmail.com>
>>>>>>
>>>>>>  I had the same problem a couple of weeks ago with 0.19.1. Had to
>>>>>> reformat
>>>>>> the cluster too...
>>>>>>> On Mon, May 4, 2009 at 3:50 PM, Stas Oskin <stas.oskin@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>>  Hi.
>>>>>>>> After rebooting the NameNode server, I found out the NameNode doesn't
>>>>>>>>
>>>>>>> start
>>>>>>>
>>>>>>>> anymore.
>>>>>>>>
>>>>>>>> The logs contained this error:
>>>>>>>> "FSNamesystem initialization failed"
>>>>>>>>
>>>>>>>>
>>>>>>>> I suspected filesystem corruption, so I tried to recover from
>>>>>>>> SecondaryNameNode. Problem is, it was completely empty!
>>>>>>>>
>>>>>>>> I had an issue that might have caused this - the root mount has run
>>>>>>>>
>>>>>>> out
>>>>>> of
>>>>>>>> space. But, both the NameNode and the SecondaryNameNode directories
>>>>>>>>
>>>>>>> were
>>>>>>> on
>>>>>>>
>>>>>>>> another mount point with plenty of space there - so it's very strange
>>>>>>>>
>>>>>>> that
>>>>>>>
>>>>>>>> they were impacted in any way.
>>>>>>>>
>>>>>>>> Perhaps the logs, which were located on root mount and as a result,
>>>>>>>>
>>>>>>> could
>>>>>>> not be written, have caused this?
>>>>>>>>
>>>>>>>> To get back HDFS running, i had to format the HDFS (including
>>>>>>>>
>>>>>>> manually
>>>>>>  erasing the files from DataNodes). While this reasonable in test
>>>>>>>> environment
>>>>>>>> - production-wise it would be very bad.
>>>>>>>>
>>>>>>>> Any idea why it happened, and what can be done to prevent it in the
>>>>>>>>
>>>>>>> future?
>>>>>>>
>>>>>>>> I'm using the stable 0.18.3 version of Hadoop.
>>>>>>>>
>>>>>>>> Thanks in advance!
>>>>>>>>
>>>>>>>>
>