nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Persistent Crawldb Checksum error
Date Mon, 05 Dec 2011 13:44:38 GMT
Im not sure that will work. A broken CrawlDB will always result in an error 
for jobs that read it. This is a Hadoop problem. 

You can try removing the part-xxxx file that throws the error and hope the 
rest works fine.

On Monday 05 December 2011 14:38:58 Lewis John Mcgibbney wrote:
> Hi Danicela,
> 
> Have a look here [1]. Although your problem is not directly linked to
> fetching, the symptoms and subsequent solution to the problem is the same.
> 
> Unfortunately this is quite a messy one but will hopefully get you going in
> the right direction again.
> 
> [1]
> http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process
> .3F
> 
> On Mon, Dec 5, 2011 at 1:06 PM, Danicela nutch <Danicela-
nutch@mail.com>wrote:
> > Hi,
> > 
> >  I was doing indexes in same time of updates, and after some successful
> > 
> > indexes, I think the crawldb was corrupted and since then all generates,
> > 
> > updates and indexes fail at the end of the process with the same error :
> >  2011-12-03 05:47:44,017 WARN mapred.LocalJobRunner - job_local_0001
> > 
> >  org.apache.hadoop.fs.ChecksumException: Checksum error:
> > file:/home/nutch/nutchexec/runs/fr4/crawldb/current/part-00000/data at
> > 3869358080
> > 
> >  at
> >  org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:278)
> >  at
> > 
> > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java
> > :242)
> > 
> >  at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:177)
> >  at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:194)
> >  at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
> >  at java.io.DataInputStream.readFully(DataInputStream.java:178)
> >  at
> > 
> > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:
> > 63)
> > 
> >  at
> >  org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> >  at
> >  org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
> >  at
> >  org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062)
> >  at
> > 
> > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecord
> > Reader.java:76)
> > 
> >  at
> > 
> > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.j
> > ava:192)
> > 
> >  at
> > 
> > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:17
> > 6)
> > 
> >  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
> >  at
> > 
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> > 
> >  2011-12-03 05:47:44,509 FATAL crawl.CrawlDb - CrawlDb update:
> > java.io.IOException: Job failed!
> > 
> >  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
> >  at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
> >  at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
> >  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >  at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)
> >  
> >  
> >  
> >  All tries fail at the '3869358080' point of 'data' file, that's why I
> > 
> > think the crawldb has a problem.
> > 
> >  What can I do to 'repair' the crawldb, if it's the problem of course ?
> >  
> >  Thanks.

-- 
Markus Jelsma - CTO - Openindex

Mime
View raw message