hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: Namenode failed to start with "FSNamesystem initialization failed" error
Date Tue, 05 May 2009 19:01:06 GMT
Stas Oskin wrote:
> Actually, we discovered today an annoying bug in our test-app, which might
> have moved some of the HDFS files to the cluster, including the metadata
> files.

oops! presumably it could have removed the image file itself.

> I presume it could be the possible reason for such behavior? :)

certainly. It could lead to many different failures. If you had stack 
trace of the exception, it would be more clear what the error was this time.

Raghu.

> 2009/5/5 Stas Oskin <stas.oskin@gmail.com>
> 
>> Hi Raghu.
>>
>> The only lead I have, is that my root mount has filled-up completely.
>>
>> This in itself should not have caused the metadata corruption, as it has
>> been stored on another mount point, which had plenty of space.
>>
>> But perhaps the fact that NameNode/SecNameNode didn't have enough space for
>> logs has caused this?
>>
>> Unfortunately I was pressed in time to get the cluster up and running, and
>> didn't preserve the logs or the image.
>> If this happens again - I will surely do so.
>>
>> Regards.
>>
>> 2009/5/5 Raghu Angadi <rangadi@yahoo-inc.com>
>>
>>
>>> Stas,
>>>
>>> This is indeed a serious issue.
>>>
>>> Did you happen to store the the corrupt image? Can this be reproduced
>>> using the image?
>>>
>>> Usually you can recover manually from a corrupt or truncated image. But
>>> more importantly we want to find how it got in to this state.
>>>
>>> Raghu.
>>>
>>>
>>> Stas Oskin wrote:
>>>
>>>> Hi.
>>>>
>>>> This quite worry-some issue.
>>>>
>>>> Can anyone advice on this? I'm really concerned it could appear in
>>>> production, and cause a huge data loss.
>>>>
>>>> Is there any way to recover from this?
>>>>
>>>> Regards.
>>>>
>>>> 2009/5/5 Tamir Kamara <tamirkamara@gmail.com>
>>>>
>>>>  I didn't have a space problem which led to it (I think). The corruption
>>>>> started after I bounced the cluster.
>>>>> At the time, I tried to investigate what led to the corruption but
>>>>> didn't
>>>>> find anything useful in the logs besides this line:
>>>>> saveLeases found path
>>>>>
>>>>>
>>>>> /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002
>>>>> but no matching entry in namespace
>>>>>
>>>>> I also tried to recover from the secondary name node files but the
>>>>> corruption my too wide-spread and I had to format.
>>>>>
>>>>> Tamir
>>>>>
>>>>> On Mon, May 4, 2009 at 4:48 PM, Stas Oskin <stas.oskin@gmail.com>
>>>>> wrote:
>>>>>
>>>>>  Hi.
>>>>>> Same conditions - where the space has run out and the fs got corrupted?
>>>>>>
>>>>>> Or it got corrupted by itself (which is even more worrying)?
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>> 2009/5/4 Tamir Kamara <tamirkamara@gmail.com>
>>>>>>
>>>>>>  I had the same problem a couple of weeks ago with 0.19.1. Had to
>>>>>> reformat
>>>>>> the cluster too...
>>>>>>> On Mon, May 4, 2009 at 3:50 PM, Stas Oskin <stas.oskin@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>>  Hi.
>>>>>>>> After rebooting the NameNode server, I found out the NameNode
doesn't
>>>>>>>>
>>>>>>> start
>>>>>>>
>>>>>>>> anymore.
>>>>>>>>
>>>>>>>> The logs contained this error:
>>>>>>>> "FSNamesystem initialization failed"
>>>>>>>>
>>>>>>>>
>>>>>>>> I suspected filesystem corruption, so I tried to recover
from
>>>>>>>> SecondaryNameNode. Problem is, it was completely empty!
>>>>>>>>
>>>>>>>> I had an issue that might have caused this - the root mount
has run
>>>>>>>>
>>>>>>> out
>>>>>> of
>>>>>>>> space. But, both the NameNode and the SecondaryNameNode directories
>>>>>>>>
>>>>>>> were
>>>>>>> on
>>>>>>>
>>>>>>>> another mount point with plenty of space there - so it's
very strange
>>>>>>>>
>>>>>>> that
>>>>>>>
>>>>>>>> they were impacted in any way.
>>>>>>>>
>>>>>>>> Perhaps the logs, which were located on root mount and as
a result,
>>>>>>>>
>>>>>>> could
>>>>>>> not be written, have caused this?
>>>>>>>>
>>>>>>>> To get back HDFS running, i had to format the HDFS (including
>>>>>>>>
>>>>>>> manually
>>>>>>  erasing the files from DataNodes). While this reasonable in test
>>>>>>>> environment
>>>>>>>> - production-wise it would be very bad.
>>>>>>>>
>>>>>>>> Any idea why it happened, and what can be done to prevent
it in the
>>>>>>>>
>>>>>>> future?
>>>>>>>
>>>>>>>> I'm using the stable 0.18.3 version of Hadoop.
>>>>>>>>
>>>>>>>> Thanks in advance!
>>>>>>>>
>>>>>>>>
> 


Mime
View raw message