hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Terry Healy <the...@bnl.gov>
Subject Re: Unable to start NN after rack assignment attempt
Date Fri, 18 May 2012 18:28:46 GMT
Todd-

Thanks for your reply. I went out on a limb and started digging in the
source code and figures it was FSImage. So I saved it, and copied over
the copy from my checkpoint directory and got running again.

I ran a few jobs to test and returned to getting a problem new node
running. Once again it looks like I will have to manually force an exit
from safe mode to run fsck -move

I sent mail to Harsh earlier - I think I must migrate to CDH as I fear
my manual hacking with configs and such has caused the fragile state
that the cluster is in now.

Thanks,

Terry

On 05/18/2012 12:34 PM, Todd Lipcon wrote:
> Hi Terry,
> 
> It seems like something got truncated in your FSImage... though it's
> unclear how that might have happened.
> 
> If you're able to share your logs and your dfs.name.dir contents, feel
> free to contact me off-list and I can try to take a look to diagnose
> the issue and try to recover the system. Of course whenever any
> corruption issue occurs we take it seriously and want to get at a root
> cause to prevent future occurrences!
> 
> Thanks
> -Todd
> 
> On Fri, May 18, 2012 at 6:57 AM, Terry Healy <thealy@bnl.gov> wrote:
>> Sorry, forgot to attach the trace:
>> <code>
>> 2012-05-18 09:54:45,355 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 128
>> 2012-05-18 09:54:45,379 ERROR
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
>> initialization failed.
>> java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>> 2012-05-18 09:54:45,380 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>
>> 2012-05-18 09:54:45,380 INFO
>> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>> /************************************************************
>> SHUTDOWN_MSG: Shutting down NameNode at abcd/1xx.1xx.2xx.3xx
>> ************************************************************/
>>
>> </code>
>>
>>
>>
>> On 05/18/2012 09:51 AM, Terry Healy wrote:
>>> Running Apache 1.0.2 ~12 datanodes
>>>
>>> Ran FSCK / -> OK, before, everything running as expected.
>>>
>>> Started trying to use a script to assign nodes to racks, which required
>>> several stop-dfs.sh / start-dfs.sh cycles. (with some stop-all.sh /
>>> start-all.sh too if that matters.
>>>
>>> Got past errors in script and data file, but dfsadmin -report still
>>> showed all assigned to default rack. I tried replacing one system name
>>> in the rack mapping file with it's IP address. At this point the NN
>>> failed to start up.
>>>
>>> So I commented out the topology.script.file.name property statements in
>>> hdfs-site.xml
>>>
>>> NN still fails to start; trace below indicating EOF Exception, but I
>>> don't know what file it can't read.
>>>
>>> As always your patience with a noob appreciated; any suggestions to get
>>> started again? (I can forget about the rack assignment for now)
>>>
>>> Thanks.
>>>
>>>
>>
>>
> 
> 
> 

-- 
Terry Healy / thealy@bnl.gov
Cyber Security Operations
Brookhaven National Laboratory
Building 515, Upton N.Y. 11973

Mime
View raw message