hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Namenode failures
Date Sun, 17 Feb 2013 22:41:14 GMT
Hello Robert,

         It seems that your edit logs and fsimage have got
corrupted somehow. It looks somewhat similar to this one
https://issues.apache.org/jira/browse/HDFS-686

Have you made any changes to the 'dfs.name.dir' directory
lately?Do you have enough space where metadata is getting
stored?You can make use of offine image viewer to diagnose
the fsimage file.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <psybers@gmail.com> wrote:

> It just happened again.  This was after a fresh format of HDFS/HBase and I
> am attempting to re-import the (backed up) data.
>
>   http://pastebin.com/3fsWCNQY
>
> So now if I restart the namenode, I will lose data from the past 3 hours.
>
> What is causing this?  How can I avoid it in the future?  Is there an easy
> way to monitor (other than a script grep'ing the logs) the checkpoints to
> see when this happens?
>
>
> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <psybers@gmail.com> wrote:
>
>> Forgot to mention: Hadoop 1.0.4
>>
>>
>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <psybers@gmail.com> wrote:
>>
>>> I am at a bit of wits end here.  Every single time I restart the
>>> namenode, I get this crash:
>>>
>>> 2013-02-16 14:32:42,616 INFO
>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>> loaded in 0 seconds.
>>> 2013-02-16 14:32:42,618 ERROR
>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>> java.lang.NullPointerException
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>
>>> I am following best practices here, as far as I know.  I have the
>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>> have the exact same files in them.
>>>
>>> I also run a secondary checkpoint node.  This one appears to have
>>> started failing a week ago.  So checkpoints were *not* being done since
>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>
>>>  What is going on here?  Why does my NN data *always* wind up causing
>>> this exception over time?  Is there some easy way to get notified when the
>>> checkpointing starts to fail?
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu
>

Mime
View raw message