nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danicela nutch" <Danicela-nu...@mail.com>
Subject Persistent Crawldb Checksum error
Date Mon, 05 Dec 2011 13:06:41 GMT
Hi,

 I was doing indexes in same time of updates, and after some successful indexes, I think the
crawldb was corrupted and since then all generates, updates and indexes fail at the end of
the process with the same error :


 2011-12-03 05:47:44,017 WARN mapred.LocalJobRunner - job_local_0001
 org.apache.hadoop.fs.ChecksumException: Checksum error: file:/home/nutch/nutchexec/runs/fr4/crawldb/current/part-00000/data
at 3869358080
 at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:278)
 at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:242)
 at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:177)
 at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:194)
 at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
 at java.io.DataInputStream.readFully(DataInputStream.java:178)
 at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
 at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
 at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
 at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2062)
 at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 2011-12-03 05:47:44,509 FATAL crawl.CrawlDb - CrawlDb update: java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
 at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)



 All tries fail at the '3869358080' point of 'data' file, that's why I think the crawldb has
a problem.

 What can I do to 'repair' the crawldb, if it's the problem of course ?

 Thanks.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message