hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: ERROR dfs.NameNode - java.io.EOFException
Date Sun, 06 Jul 2008 22:48:02 GMT
Hi,
(lohit has been helping me with this and asked me to send this detailed message to the list
for others to see and possibly help)

Right, I saw the permission reference, too.  I assume(d) the permission
information is stored inside the edits file and that that part of the
edits is corrupt and causing the exception.

I'm using 0.16.2
(with Nutch).  I'm not upgrading.  Hadoop died while I wasn't watching
it, while running the usual Nutch MapReduce jobs.  I'm running
everything with the same user.  I'm really not making any changes. 
Hadoop just died and when I tried restarting it, NN would not start.

The first errors from the logs when things went bad look like this:

Task task_200806101759_0344_r_000001_0 failed to report status for 601 seconds. Killing!
Task task_200806101759_0344_r_000009_0 failed to report status for 600 seconds. Killing!
Task task_200806101759_0344_r_000007_0 failed to report status for 602 seconds. Killing!
Task task_200806101759_0344_r_000008_0 failed to report status for 601 seconds. Killing!
Error initializing task_200806101759_0344_r_000009_1:
org.apache.hadoop.util.DiskChecker$DiskErrorException:
Could not find any valid local directory for
taskTracker/jobcache/job_200806101759_0344/job.xml
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:639)
        at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
        at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)
...
...
From
the task name, it looks like it was started on 2008-06-10.  But the
errors above happened on June 23rd - 13 days later, and I am quite sure
my jobs were not taking 13 days to complete!

Then a little later in the log I see:

java.io.IOException: task_200806101759_0352_r_000005_0The reduce copier failed
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:260)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084)


And then a little later I see this:

FSError: java.io.IOException: No space left on device
task_200806101759_0370_m_000040_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.TaskRunner).
task_200806101759_0370_m_000040_0: log4j:WARN Please initialize the log4j system properly.
FSError: java.io.IOException: No space left on device
FSError: java.io.IOException: No space left on device
task_200806101759_0370_m_000076_0:
Exception in thread "SortSpillThread" org.apache.hadoop.fs.FSError:
java.io.IOException: No space left on device
task_200806101759_0370_m_000076_0: 
    at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:171)
task_200806101759_0370_m_000076_0:      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
task_200806101759_0370_m_000076_0:      at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
task_200806101759_0370_m_000076_0: 
    at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:41)
task_200806101759_0370_m_000076_0:      at java.io.DataOutputStream.write(DataOutputStream.java:90)
task_200806101759_0370_m_000076_0: 
    at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:339)
task_200806101759_0370_m_000076_0:      at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:141)
task_200806101759_0370_m_000076_0:      at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:124)
task_200806101759_0370_m_000076_0:      at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:58)
task_200806101759_0370_m_000076_0: 
    at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:36)
task_200806101759_0370_m_000076_0:      at java.io.DataOutputStream.writeInt(DataOutputStream.java:183)
task_200806101759_0370_m_000076_0:      at org.apache.hadoop.io.SequenceFile$Writer.sync(SequenceFile.java:927)
task_200806101759_0370_m_000076_0:      at org.apache.hadoop.io.SequenceFile$Writer.checkAndWriteSync(SequenceFile.java:954)
task_200806101759_0370_m_000076_0:      at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:987)
task_200806101759_0370_m_000076_0:      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:555)
task_200806101759_0370_m_000076_0:      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:497)
task_200806101759_0370_m_000076_0:      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:264)
task_200806101759_0370_m_000076_0:      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.run(MapTask.java:439)
task_200806101759_0370_m_000076_0: Caused by: java.io.IOException: No space left on device
task_200806101759_0370_m_000076_0:      at java.io.FileOutputStream.writeBytes(Native Method)
task_200806101759_0370_m_000076_0:      at java.io.FileOutputStream.write(FileOutputStream.java:260)
task_200806101759_0370_m_000076_0: 
    at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:169)
task_200806101759_0370_m_000076_0:      ... 17 more

The weird thing is that none of the nodes in the cluster are out of disk space!
(but
maybe that's because one of the nodes really ran out of disk
temporarily, and then some files got cleaned up, so now disks are no
longer full)

$ for h in `cat ~/nutch/conf/slaves`; do echo $h; ssh $h df -h | grep mnt; done;
localhost ---- this is the NN
/dev/sdb              414G  384G  9.1G  98% /mnt
/dev/sdc              414G   91G  302G  24% /mnt2
10.252.93.155
/dev/sdb              414G  389G  3.7G 100% /mnt  // but it has 3.7GB free!
/dev/sdc              414G   93G  300G  24% /mnt2
10.252.239.63
/dev/sdb              414G  388G  4.5G  99% /mnt
/dev/sdc              414G   90G  303G  23% /mnt2
10.251.235.224
/dev/sdb              414G  362G   32G  93% /mnt
/dev/sdc              414G   92G  301G  24% /mnt2
10.252.230.32
/dev/sdb              414G  189G  205G  48% /mnt
/dev/sdc              414G  183G  210G  47% /mnt2


The
error when I try starting NN now is an EOFException.  Is there
something that tells Hadoop how many records to read from edits file? 
If so, then if that number is greater than the number of records in the
edits file, then I don't think I'll be able to fix the problem by
removing lines from the edits file.  No?

I don't mind losing
*some* data in HDFS.  Can I just remove edits file completely and
assume that NN will start (even though it won't know about some portion
of data in HDFS which I assume I would then have to find and
remove/clean up somehow myself)?

I've been running Hadoop for
several months now.  The first records in the "edits" file seem to be
from 2008-06-10 and most of the records seem to be from June 10, while
I started seeing errors in the logs on June 23.  Here are some details:

## 74K lines
[otis@domU-12-31-38-00-E8-D1 logs]$ wc -l /mnt/nutch/filesystem/name/current/edits
74403 /mnt/nutch/filesystem/name/current/edits

## 454K lines of "strings"
[otis@domU-12-31-38-00-E8-D1 logs]$ strings /mnt/nutch/filesystem/name/current/edits | wc
-l
454271

## Nothing from before June 10
[otis@domU-12-31-38-00-E8-D1 logs]$ strings /mnt/nutch/filesystem/name/current/edits | grep
2008060 | wc -l
0

## 139K lines from June (nothing from before June 10)
[otis@domU-12-31-38-00-E8-D1 logs]$ strings /mnt/nutch/filesystem/name/current/edits | grep
200806 | wc -l
139524

## most of the records are from June 10, seems related to those problematic tasks from 20080610
[otis@domU-12-31-38-00-E8-D1 logs]$ strings /mnt/nutch/filesystem/name/current/edits | grep
200806 | grep -c 20080610
130519

## not much from June 11 (12th, 13th, and so on)
[otis@domU-12-31-38-00-E8-D1 logs]$ strings /mnt/nutch/filesystem/name/current/edits | grep
20080611 | wc -l
1834

## the last few non June 10th lines with the string "200806" in them.  I think 27th is when
Hadoop completely died.
[otis@domU-12-31-38-00-E8-D1 logs]$ strings /mnt/nutch/filesystem/name/current/edits | grep
200806 | grep -v 20080610 | tail
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00012
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00013
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00014
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00011
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00005
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00004
F/user/otis/crawl/segments/20080627171831/crawl_generate/_temporary
;/user/otis/crawl/segments/20080627171831/crawl_generate
;/user/otis/crawl/segments/20080627171831/crawl_generate
7/user/otis/crawl/segments/20080627171831/_temporary

## the last few lines
[otis@domU-12-31-38-00-E8-D1 logs]$ strings /mnt/nutch/filesystem/name/current/edits | tail
67108864
otis
supergroup
f/user/otis/crawl/segments/20080627171831/_temporary/_task_200806101759_0407_r_000004_0/crawl_fetch
1214687074224
otis
supergroup
q/user/otis/crawl/segments/20080627171831/_temporary/_task_200806101759_0407_r_000004_0/crawl_fetch/part-00004
1214687074224
otis

Sorry for the long email.  Maybe you will see something in all this.  Any help would be greatly
appreciated.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
> To: core-user@hadoop.apache.org
> Sent: Saturday, July 5, 2008 10:36:37 AM
> Subject: Re: ERROR dfs.NameNode - java.io.EOFException
> 
> Hm, tried it (simply edited with vi, removed the last line).  I did it with both 
> edits file and fsimage file (I see references to FSEditLog.java and FSImage.java 
> in the stack trace below), but that didn't seem to help.  Namenode just doesn't 
> start at all.  I can't see any errors in logs due to a failed startup, I just 
> see that exception below when I actually try using Hadoop.
> 
> What's the damage if I simply remove all of their content? (a la > edits and >

> fsimage)
> 
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
> > From: lohit 
> > To: core-user@hadoop.apache.org
> > Sent: Saturday, July 5, 2008 10:08:57 AM
> > Subject: Re: ERROR dfs.NameNode - java.io.EOFException
> > 
> > I remember dhruba telling me about this once.
> > Yes, Take a backup of the whole current directory.
> > As you have seen, remove the last line from edits and try to start the 
> NameNode. 
> > 
> > If it starts, then run fsck to find out which file had the problem. 
> > Thanks,
> > Lohit
> > 
> > ----- Original Message ----
> > From: Otis Gospodnetic 
> > To: core-user@hadoop.apache.org
> > Sent: Friday, July 4, 2008 4:46:57 PM
> > Subject: Re: ERROR dfs.NameNode - java.io.EOFException
> > 
> > Hi,
> > 
> > If it helps with the problem below -- I don't mind losing some data.
> > For instance, I see my "edits" file has about 74K lines.
> > Can I just nuke the edits file or remove the last N lines?
> > 
> > I am looking at the edits file with vi and I see the very last line is very 
> > short - it looks like it was cut off, incomplete, and some of the logs do 
> > mention running out of disk space (even though the NN machine has some more 
> free 
> > space).
> > 
> > Could I simply remove this last incomplete line?
> > 
> > Any help would be greatly appreciated.
> > 
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > 
> > ----- Original Message ----
> > > From: Otis Gospodnetic 
> > > To: core-user@hadoop.apache.org
> > > Sent: Friday, July 4, 2008 2:00:58 AM
> > > Subject: ERROR dfs.NameNode - java.io.EOFException
> > > 
> > > Hi,
> > > 
> > > Using Hadoop 0.16.2, I am seeing seeing the following in the NN log:
> > > 
> > > 2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException
> > >         at java.io.DataInputStream.readFully(DataInputStream.java:180)
> > >         at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
> > >         at 
> > org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
> > >         at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
> > >         at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756)
> > >         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
> > >         at 
> > org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
> > >         at 
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
> > >         at 
> > org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
> > >         at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235)
> > >         at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
> > >         at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
> > >         at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
> > >         at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
> > >         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)
> > > 
> > > The exception doesn't include the name and location of the file whose 
> reading 
> > is 
> > > failing and causing EOFException :(
> > > But it looks like it's the fsedit log (the "edits" file, I think).
> > > 
> > > There is no secondary NN in the cluster.
> > > 
> > > Is there any way I can revive this NN?  Any way to "fix" the corrupt "edits"

> 
> > > file?
> > > 
> > > Thanks,
> > > Otis


Mime
View raw message