hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hansen <jer...@skidrow.la>
Subject Re: IMAGE_AND_EDITS Failed
Date Wed, 07 Sep 2011 16:39:46 GMT
The problem is that fsimage and edits are no longer being updated, so…if I restart, how could
it replay those?

-jeremy


On Sep 7, 2011, at 8:48 AM, Ravi Prakash wrote:

> Actually I take that back. Restarting the NN might not result in loss of
> data. It will probably just take longer to start up because it would read
> the fsimage, then apply the fsedits (rather than the SNN doing it).
> 
> On Wed, Sep 7, 2011 at 10:46 AM, Ravi Prakash <ravihadoop@gmail.com> wrote:
> 
>> Hi Jeremy,
>> 
>> Couple of questions:
>> 
>> 1. Which version of Hadoop are you using?
>> 2. If you write something into HDFS, can you subsequently read it?
>> 3. Are you sure your secondarynamenode configuration is correct? It seems
>> like your SNN is telling your NN to roll the edit log (move the journaling
>> directory from current to .new), but when it tries to download the image
>> file, its not finding it.
>> 3. I wish I could say I haven't ever seen that stack trace in the logs. I
>> was seeing something similar (not the same, quite far from it actually) (
>> https://issues.apache.org/jira/browse/HDFS-2011 ).
>> 
>> If I were you, and I felt exceptionally brave (mind you I've worked with
>> only test systems, no production sys-admin guts for me ;-) ) I would
>> probably do everything I can, to get the secondarynamenode started properly
>> and make it checkpoint properly.
>> 
>> Me thinks restarting the namenode will most likely result in loss of data.
>> 
>> Hope this helps
>> Ravi.
>> 
>> 
>> 
>> 
>> On Tue, Sep 6, 2011 at 7:26 PM, Jeremy Hansen <jeremy@skidrow.la> wrote:
>> 
>>> 
>>> I happened to notice this today and being fairly new to administering
>>> hadoop, I'm not exactly sure how to pull out of this situation without data
>>> loss.
>>> 
>>> The checkpoint hasn't happened since Sept 2nd.
>>> 
>>> -rw-r--r-- 1 hdfs hdfs        8889 Sep  2 14:09 edits
>>> -rw-r--r-- 1 hdfs hdfs   195968056 Sep  2 14:09 fsimage
>>> -rw-r--r-- 1 hdfs hdfs   195979439 Sep  2 14:09 fsimage.ckpt
>>> -rw-r--r-- 1 hdfs hdfs           8 Sep  2 14:09 fstime
>>> -rw-r--r-- 1 hdfs hdfs         100 Sep  2 14:09 VERSION
>>> 
>>> /mnt/data0/dfs/nn/image
>>> -rw-r--r-- 1 hdfs hdfs    157 Sep  2 14:09 fsimage
>>> 
>>> I'm also seeing this in the NN logs:
>>> 
>>> 2011-09-06 16:48:23,738 INFO org.apache.hadoop.hdfs.server.**namenode.FSNamesystem:
>>> Roll Edit Log from 10.10.10.11
>>> 2011-09-06 16:48:23,740 WARN org.mortbay.log: /getimage:
>>> java.io.IOException: GetImage failed. java.lang.NullPointerException
>>>       at org.apache.hadoop.hdfs.server.**namenode.FSImage.getImageFile(*
>>> *FSImage.java:219)
>>>       at org.apache.hadoop.hdfs.server.**namenode.FSImage.**
>>> getFsImageName(FSImage.java:**1584)
>>>       at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**
>>> run(GetImageServlet.java:75)
>>>       at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet$1.**
>>> run(GetImageServlet.java:70)
>>>       at java.security.**AccessController.doPrivileged(**Native Method)
>>>       at javax.security.auth.Subject.**doAs(Subject.java:396)
>>>       at org.apache.hadoop.security.**UserGroupInformation.doAs(**
>>> UserGroupInformation.java:**1115)
>>>       at org.apache.hadoop.hdfs.server.**namenode.GetImageServlet.**
>>> doGet(GetImageServlet.java:70)
>>>       at javax.servlet.http.**HttpServlet.service(**
>>> HttpServlet.java:707)
>>>       at javax.servlet.http.**HttpServlet.service(**
>>> HttpServlet.java:820)
>>>       at org.mortbay.jetty.servlet.**ServletHolder.handle(**
>>> ServletHolder.java:511)
>>>       at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
>>> doFilter(ServletHandler.java:**1221)
>>>       at org.apache.hadoop.http.**HttpServer$QuotingInputFilter.**
>>> doFilter(HttpServer.java:824)
>>>       at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
>>> doFilter(ServletHandler.java:**1212)
>>>       at org.mortbay.jetty.servlet.**ServletHandler.handle(**
>>> ServletHandler.java:399)
>>>       at org.mortbay.jetty.security.**SecurityHandler.handle(**
>>> SecurityHandler.java:216)
>>>       at org.mortbay.jetty.servlet.**SessionHandler.handle(**
>>> SessionHandler.java:182)
>>>       at org.mortbay.jetty.handler.**ContextHandler.handle(**
>>> ContextHandler.java:766)
>>>       at org.mortbay.jetty.webapp.**WebAppContext.handle(**
>>> WebAppContext.java:450)
>>>       at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(*
>>> *ContextHandlerCollection.java:**230)
>>>       at org.mortbay.jetty.handler.**HandlerWrapper.handle(**
>>> HandlerWrapper.java:152)
>>>       at org.mortbay.jetty.Server.**handle(Server.java:326)
>>>       at org.mortbay.jetty.**HttpConnection.handleRequest(**
>>> HttpConnection.java:542)
>>>       at org.mortbay.jetty.**HttpConnection$RequestHandler.**
>>> headerComplete(HttpConnection.**java:928)
>>>       at org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:549)
>>>       at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.**
>>> java:212)
>>>       at org.mortbay.jetty.**HttpConnection.handle(**
>>> HttpConnection.java:404)
>>> 
>>> On the secondary name node:
>>> 
>>> 2011-09-06 16:51:53,538 ERROR org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode:
>>> java.io.FileNotFoundException: http://ftrr-nam6000.**
>>> chestermcgee.com:50070/**getimage?getimage=1<http://ftrr-nam6000.chestermcgee.com:50070/getimage?getimage=1>
>>>       at sun.reflect.**NativeConstructorAccessorImpl.**newInstance0(Native
>>> Method)
>>>       at sun.reflect.**NativeConstructorAccessorImpl.**newInstance(**
>>> NativeConstructorAccessorImpl.**java:39)
>>>       at sun.reflect.**DelegatingConstructorAccessorI**mpl.newInstance(*
>>> *DelegatingConstructorAccessorI**mpl.java:27)
>>>       at java.lang.reflect.Constructor.**newInstance(Constructor.java:**
>>> 513)
>>>       at sun.net.www.protocol.http.**HttpURLConnection$6.run(**
>>> HttpURLConnection.java:1360)
>>>       at java.security.**AccessController.doPrivileged(**Native Method)
>>>       at sun.net.www.protocol.http.**HttpURLConnection.**
>>> getChainedException(**HttpURLConnection.java:1354)
>>>       at sun.net.www.protocol.http.**HttpURLConnection.**getInputStream(
>>> **HttpURLConnection.java:1008)
>>>       at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
>>> getFileClient(TransferFsImage.**java:183)
>>>       at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$3.**
>>> run(SecondaryNameNode.java:**348)
>>>       at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode$3.**
>>> run(SecondaryNameNode.java:**337)
>>>       at java.security.**AccessController.doPrivileged(**Native Method)
>>>       at javax.security.auth.Subject.**doAs(Subject.java:396)
>>>       at org.apache.hadoop.security.**UserGroupInformation.doAs(**
>>> UserGroupInformation.java:**1115)
>>>       at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
>>> downloadCheckpointFiles(**SecondaryNameNode.java:337)
>>>       at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
>>> doCheckpoint(**SecondaryNameNode.java:422)
>>>       at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
>>> doWork(SecondaryNameNode.java:**313)
>>>       at org.apache.hadoop.hdfs.server.**namenode.SecondaryNameNode.**
>>> run(SecondaryNameNode.java:**276)
>>>       at java.lang.Thread.run(Thread.**java:619)
>>> Caused by: java.io.FileNotFoundException: http://ftrr-nam6000.las1.**
>>> fanops.net:50070/getimage?**getimage=1<http://ftrr-nam6000.las1.fanops.net:50070/getimage?getimage=1>
>>>       at sun.net.www.protocol.http.**HttpURLConnection.**getInputStream(
>>> **HttpURLConnection.java:1303)
>>>       at sun.net.www.protocol.http.**HttpURLConnection.**getHeaderField(
>>> **HttpURLConnection.java:2165)
>>>       at org.apache.hadoop.hdfs.server.**namenode.TransferFsImage.**
>>> getFileClient(TransferFsImage.**java:175)
>>>       ... 10 more
>>> 
>>> Any help would be very much appreciated.  I'm scared to shut down the NN.
>>> I've tried restarting the 2NN.
>>> 
>>> Thank You
>>> -jeremy
>>> 
>> 
>> 


Mime
View raw message