hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tamir Kamara <tamirkam...@gmail.com>
Subject Re: Namenode failed to start with "FSNamesystem initialization failed" error
Date Sun, 10 May 2009 08:11:12 GMT
Filed HADOOP-5798.


On Wed, May 6, 2009 at 9:53 PM, Raghu Angadi <rangadi@yahoo-inc.com> wrote:

> Tamir Kamara wrote:
>
>> Hi Raghu,
>>
>> The thread you posted is my original post written when this problem first
>> happened on my cluster. I can file a JIRA but I wouldn't be able to
>> provide
>> information other than what I already posted and I don't have the logs
>> from
>> that time. Should I still file ?
>>
>
> yes. Jira is a better place for tracking and fixing bugs. I am pretty sure
> what you saw is a bug (either already or needs to be fixed).
>
> Raghu.
>
>
>  Thanks,
>> Tamir
>>
>>
>> On Tue, May 5, 2009 at 9:14 PM, Raghu Angadi <rangadi@yahoo-inc.com>
>> wrote:
>>
>>  Tamir,
>>>
>>> Please file a jira on the problem you are seeing with 'saveLeases'. In
>>> the
>>> past there have been multiple fixes in this area (HADOOP-3418,
>>> HADOOP-3724,
>>> and more mentioned in HADOOP-3724).
>>>
>>> Also refer the thread you started
>>> http://www.mail-archive.com/core-user@hadoop.apache.org/msg09397.html
>>>
>>> I think another user reported the same problem recently.
>>>
>>> These are indeed very serious and very annoying bugs.
>>>
>>> Raghu.
>>>
>>>
>>> Tamir Kamara wrote:
>>>
>>>  I didn't have a space problem which led to it (I think). The corruption
>>>> started after I bounced the cluster.
>>>> At the time, I tried to investigate what led to the corruption but
>>>> didn't
>>>> find anything useful in the logs besides this line:
>>>> saveLeases found path
>>>>
>>>>
>>>> /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002
>>>> but no matching entry in namespace
>>>>
>>>> I also tried to recover from the secondary name node files but the
>>>> corruption my too wide-spread and I had to format.
>>>>
>>>> Tamir
>>>>
>>>> On Mon, May 4, 2009 at 4:48 PM, Stas Oskin <stas.oskin@gmail.com>
>>>> wrote:
>>>>
>>>>  Hi.
>>>>
>>>>> Same conditions - where the space has run out and the fs got corrupted?
>>>>>
>>>>> Or it got corrupted by itself (which is even more worrying)?
>>>>>
>>>>> Regards.
>>>>>
>>>>> 2009/5/4 Tamir Kamara <tamirkamara@gmail.com>
>>>>>
>>>>>  I had the same problem a couple of weeks ago with 0.19.1. Had to
>>>>>
>>>>>> reformat
>>>>>> the cluster too...
>>>>>>
>>>>>> On Mon, May 4, 2009 at 3:50 PM, Stas Oskin <stas.oskin@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>  Hi.
>>>>>>
>>>>>>> After rebooting the NameNode server, I found out the NameNode
doesn't
>>>>>>>
>>>>>>>  start
>>>>>>
>>>>>>  anymore.
>>>>>>>
>>>>>>> The logs contained this error:
>>>>>>> "FSNamesystem initialization failed"
>>>>>>>
>>>>>>>
>>>>>>> I suspected filesystem corruption, so I tried to recover from
>>>>>>> SecondaryNameNode. Problem is, it was completely empty!
>>>>>>>
>>>>>>> I had an issue that might have caused this - the root mount has
run
>>>>>>> out
>>>>>>>
>>>>>>>  of
>>>>>>
>>>>>>  space. But, both the NameNode and the SecondaryNameNode directories
>>>>>>>
>>>>>>>  were
>>>>>> on
>>>>>>
>>>>>>  another mount point with plenty of space there - so it's very strange
>>>>>>>
>>>>>>>  that
>>>>>>
>>>>>>  they were impacted in any way.
>>>>>>>
>>>>>>> Perhaps the logs, which were located on root mount and as a result,
>>>>>>>
>>>>>>>  could
>>>>>> not be written, have caused this?
>>>>>>
>>>>>>>
>>>>>>> To get back HDFS running, i had to format the HDFS (including
>>>>>>> manually
>>>>>>> erasing the files from DataNodes). While this reasonable in test
>>>>>>> environment
>>>>>>> - production-wise it would be very bad.
>>>>>>>
>>>>>>> Any idea why it happened, and what can be done to prevent it
in the
>>>>>>>
>>>>>>>  future?
>>>>>>
>>>>>>  I'm using the stable 0.18.3 version of Hadoop.
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>>
>>>>>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message